Back to Glossary

What is an LLM (Large Language Model)?

Last reviewed by Moderation API

A large language model, or LLM, is a deep neural network with billions to trillions of parameters trained on internet-scale text to predict the next token in a sequence. That deceptively simple objective, applied at enormous scale, produces systems that can translate languages, write code, summarize documents, and classify nuanced content like hate speech or self-harm with a fluency that a decade of hand-crafted rules never reached.

How LLMs actually work

Virtually every modern LLM is built on the transformer architecture introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need." Transformers use a mechanism called self-attention to weigh the relationships between every token in a sequence, which lets the model capture long-range context in a way earlier recurrent networks could not.

Training happens in two broad stages. First comes pretraining, where the model is shown trillions of tokens scraped from the web, books, code repositories, and curated datasets, and learns to predict the next token. Then comes post-training: supervised fine-tuning on human-written examples, followed by reinforcement learning from human feedback (RLHF) or constitutional AI methods that align the model with safety and helpfulness goals.

The result is a general-purpose reasoning engine that can be steered with natural-language instructions.

The major families

The commercial LLM market is dominated by a handful of model families. OpenAI's GPT series (GPT-3 in 2020, GPT-4 in 2023, and successors) kicked off the current wave. Anthropic's Claude models emphasize safety and long-context reasoning. Google DeepMind's Gemini family integrates with search and multimodal inputs. Meta's Llama series, released under open weights starting in 2023, made frontier-scale models available to anyone with a GPU and seeded an open ecosystem that now includes Mistral, Qwen, and DeepSeek. Parameter counts have scaled from 1.5 billion in GPT-2 to frontier models in the hundreds of billions to trillions, though the industry has increasingly shifted attention from raw parameter count to training-compute efficiency and inference cost per token.

Why LLMs transformed content moderation

Traditional moderation relied on keyword lists, regex, and narrow classifiers trained on a single label like "toxicity" or "spam." These systems struggle with sarcasm, coded language, code-switching between languages, and novel harms that do not match any existing training data.

LLMs changed the equation in three ways:

  • Zero-shot and few-shot classification: one model can enforce an arbitrary policy described in plain English, with no retraining required when the policy changes.
  • Context understanding: an LLM can read the full thread, the user's history, and the platform's rules together, which enables decisions that depend on intent rather than surface features.
  • Multilingual coverage: frontier models handle 50+ languages competently out of the box, which removes the need for a separate classifier per locale.

Platforms like Moderation API now combine LLM-based policy reasoning with faster specialized classifiers, using each where it performs best. The specialized models handle the high-volume obvious cases cheaply; the LLM handles the edge cases that actually need to read the conversation.

Risks, costs, and safety-tuned LLMs

LLMs are not a silver bullet. They hallucinate, producing confident but wrong answers. They can be jailbroken or manipulated through prompt injection. Inference is expensive and introduces latency that matters for real-time moderation of livestreams or chat. They also inherit biases from their training data.

The industry has responded with two patterns. The first is fine-tuning, adapting a base model on labeled moderation data to lock in policy behavior and reduce cost. The second is the rise of purpose-built safety LLMs. Meta's Llama Guard (released in December 2023 and iterated through Llama Guard 3 and 4) is an open-weight model trained specifically to classify prompts and responses against a taxonomy of harms. Similar systems like ShieldGemma and NVIDIA NeMo Guardrails are widely deployed as a protective layer around general-purpose chatbots.

The direction of travel is clear: moderation is moving from narrow single-purpose classifiers toward ensembles, where a general-purpose LLM does the nuanced reasoning and a safety-tuned LLM polices its outputs.

Find out what we'd flag on your platform