Technology
Last updated
57 content moderation terms tagged technology.
- Abuse Detection
Abuse detection is the process of identifying harmful or abusive user-generated content — such as harassment, threats, or hate speech — using automated tools, human moderators, or a combination of both.
- Active Learning
Active learning is a training strategy where the model itself selects the most informative unlabeled examples for humans to annotate next, typically the cases it is least certain about. It reduces labeling cost by focusing human effort on the examples that will improve the model the most.
- Age Verification
Age verification is the process of confirming a user's age before granting access to age-restricted content or features, using methods ranging from government ID checks and credit card verification to facial age estimation and behavioral age assurance. Requirements vary by jurisdiction and by the risk profile of the platform.
- AI Generated Content (AIGC)
AI-generated content (AIGC) is text, images, audio, or video produced by generative AI models such as ChatGPT, DALL-E, or Stable Diffusion, rather than by a human author. AIGC raises new moderation challenges around spam, misinformation, and deepfakes.
- AI Guardrails
AI guardrails are the rules, filters, and policies built around an AI system to keep its inputs and outputs within safe and ethical boundaries — preventing the model from generating harmful, biased, or off-policy content even when prompted to do so.
- AI Voice Cloning Scam
An AI voice cloning scam uses a few seconds of recorded speech — pulled from social media, voicemail, or a short phone call — to generate a synthetic copy of someone's voice, then impersonates them in a fake emergency call demanding money. The most common variant is the grandparent scam, where the cloned voice of a child or grandchild claims to be in jail, in the hospital, or stranded abroad.
- AI Watermarking
AI watermarking is the practice of embedding imperceptible signals into AI-generated text, images, audio, or video so that the content can later be identified as machine-produced. Techniques range from statistical token-level markers in language model output to pixel-level perturbations in generated images.
- Algorithmic Moderation
Algorithmic moderation is the use of rule-based or pattern-matching algorithms to automatically detect and manage inappropriate content, without requiring a human reviewer for each decision.
- Artificial Intelligence (AI) Moderation
AI moderation is the use of machine learning, natural language processing, and computer vision models to identify and filter harmful content at a speed and scale that human reviewers cannot match alone.
- Automated Moderation
Automated moderation is the use of software tools — including rules engines, AI classifiers, and word filters — to review and act on user-generated content without human intervention in the moment.
- Bot Detection
Bot detection is the identification of automated accounts and scripted traffic through a combination of behavioral analysis, device fingerprinting, network signals, and challenge-response checks. It underpins abuse prevention across spam, fraud, scraping, and inauthentic engagement.
- Business Email Compromise (BEC)
Business email compromise is a targeted fraud in which attackers impersonate an executive, employee, or vendor over email to redirect wire transfers, payroll, or invoice payments to accounts they control. It relies on social engineering and spoofed or compromised email accounts rather than malware.
- C2PA
The Coalition for Content Provenance and Authenticity is an open technical standard for attaching cryptographically signed provenance metadata to media, recording how a file was captured, edited, and published. It is designed to help platforms and users distinguish authentic media from manipulated or synthetic content.
- Confusion Matrix
A confusion matrix is a 2x2 table that breaks model predictions into true positives, false positives, true negatives, and false negatives. It is the source table from which precision, recall, F1, and accuracy are calculated.
- Content Filtering
Content filtering is the process of screening incoming user-generated content against a set of rules, classifiers, or word lists, and blocking or removing anything that matches an inappropriate or harmful category before it reaches other users.
- Contextual Analysis
Contextual analysis is the examination of a piece of content within its surrounding context — tone, intent, conversation history, and platform norms — to determine whether it actually violates a policy, rather than judging it on the literal words alone.
- Crypto Scam
A crypto scam is any fraud that exploits cryptocurrency rails to steal funds, including fake exchanges, celebrity giveaway scams, wallet drainers, and fraudulent token launches. The irreversibility of on-chain transactions makes recovery extremely difficult once assets have moved.
- Dark Web
The dark web is a portion of the internet that is not indexed by traditional search engines and can only be reached through anonymizing software such as Tor. Its anonymity makes it both a refuge for privacy-conscious users and a venue for illegal activity.
- Data Labeling
Data labeling is the process of annotating raw content with the correct categories so it can be used to train or evaluate a classifier. Reliable labeling depends on clear guidelines, well-trained annotators, and measuring inter-annotator agreement to catch ambiguous or inconsistent cases.
- Deepfake
A deepfake is a piece of synthetic media — typically video, audio, or image — in which a person's likeness or voice has been replaced or generated using deep learning. Deepfakes are widely used for non-consensual intimate imagery, fraud, impersonation scams, and political disinformation.
- Deepfake Scam
A deepfake scam uses AI-generated synthetic video or audio to impersonate a real person for fraud, including fake CEO video calls authorizing wire transfers, fabricated celebrity endorsements of investment schemes, and synthetic non-consensual intimate imagery used for sextortion. As generation tools get cheaper and faster, deepfakes are becoming a default layer in business email compromise and romance scams.
- F1 score
The F1 score is the harmonic mean of precision and recall, used to evaluate a moderation classifier with a single number between 0 and 1. A perfect F1 of 1 means the system catches every harmful item without any false positives.
- False Negative
A false negative in content moderation is an instance where harmful or policy-violating content slips through undetected by the moderation system. False negatives expose users to harm and indicate gaps in the model or rule set.
- False Positive
A false positive in content moderation is an instance where benign content is incorrectly flagged or removed as harmful by a moderation tool or algorithm. High false positive rates frustrate users and erode trust in the platform.
- Fraud Detection
Fraud detection is the process of identifying and preventing deceptive activity on a platform — including scams, fake accounts, payment fraud, and account takeovers — typically by combining behavioral signals, device fingerprinting, and machine learning models.
- Grooming Detection
Grooming detection is the conversation-level identification of patterns used by adults to build trust with minors for the purpose of sexual exploitation. It relies on analyzing multi-turn dialogue — including isolation tactics, boundary testing, and escalation toward sexual content — rather than judging single messages in isolation.
- Ground Truth
Ground truth is the human-labeled reference set that a classifier is trained and evaluated against, representing the correct answer for each example. The quality of the ground truth sets a hard ceiling on how good any model trained or measured against it can be.
- Hash Matching
Hash matching is a detection technique that compares the cryptographic or perceptual fingerprint of a piece of content against a database of known-harmful material, enabling instant identification without re-analyzing the media. It is the backbone of industry efforts to remove CSAM and terrorist content at scale.
- LLM
An LLM (Large Language Model) is a neural network trained on huge volumes of text to understand and generate human-like language. In content moderation, LLMs power nuanced classification, intent detection, and policy reasoning that rule-based systems cannot.
- LLM Hallucination
An LLM hallucination is a confident but factually incorrect or fabricated output produced by a large language model, such as invented citations, nonexistent people, or false historical claims. It stems from the model predicting plausible-sounding text rather than retrieving verified information.
- LLM Jailbreak
An LLM jailbreak is a prompt or sequence of prompts crafted to bypass a language model's built-in safety rules and elicit content the model is trained to refuse. Techniques include role-play framing, obfuscated instructions, and multi-turn manipulation that gradually erodes the model's guardrails.
- Machine Learning Moderation
Machine learning moderation is the use of supervised models trained on labeled examples of past moderation decisions to predict whether new content violates policy, improving accuracy and throughput as more data is collected.
- MLCommons safety categories
The MLCommons safety categories are a standardized taxonomy of 13 harm types — created by the MLCommons consortium — used to evaluate and compare the safety of AI models on a consistent set of risks such as violent crimes, hate, and self-harm.
- Model Drift
Model drift is the gradual decay in a classifier's accuracy as the language, topics, and attack patterns it sees in production diverge from the data it was trained on. Left unchecked, drift silently erodes precision and recall until the model is retrained on fresh examples.
- NLP
NLP (Natural Language Processing) is the branch of artificial intelligence that gives machines the ability to read, interpret, and generate human language. In content moderation, NLP powers toxicity classifiers, intent detection, and multilingual filtering.
- Nudity Detection
Nudity detection is the use of computer vision models to identify images or video frames containing nudity or sexually explicit material, so the platform can blur, remove, or escalate the content for human review.
- OCR (Optical Character Recognition)
Optical character recognition is the extraction of machine-readable text from images, scanned documents, and video frames. In moderation workflows it is used to surface hidden text in memes, screenshots, and overlays so that the content can be analyzed by the same classifiers applied to regular text.
- Perspective API
Perspective API is a public toxicity classification service from Google Jigsaw that scores text across attributes such as toxicity, severe toxicity, insult, threat, and identity attack. It is widely used by publishers, forums, and researchers as a baseline for detecting harmful conversational content.
- Phishing
Phishing is a social engineering attack in which the attacker impersonates a trusted entity — a bank, an employer, a well-known brand — in an email, text message, or website in order to trick the victim into handing over credentials, payment information, or access to a device. Variants include spear phishing, smishing (SMS), and vishing (voice).
- PhotoDNA
PhotoDNA is a hash-matching technology developed by Microsoft that creates a robust digital signature of an image, allowing platforms to detect known child sexual abuse material even after cropping, resizing, or minor edits. It is widely deployed across major platforms and reported into NCMEC's CyberTipline.
- PII Detection
PII detection is the automated identification of personally identifiable information — such as names, addresses, phone numbers, government IDs, and financial details — inside user-generated content so it can be redacted, blocked, or routed for review. It is central to doxxing prevention, privacy compliance, and safe data handling.
- Post Moderation
Post moderation is the practice of letting user-generated content go live immediately and reviewing it after publication, removing anything that turns out to violate policy. It maximizes engagement speed at the cost of brief exposure to harmful content.
- Pre Moderation
Pre moderation is the practice of reviewing user-generated content before it goes live, blocking anything that violates policy from ever appearing publicly. It prevents harm but adds latency and can slow down community engagement.
- Precision
Precision is a moderation metric that measures what fraction of the items flagged by a classifier are actually harmful. High precision means few false positives — when the system takes action, it is almost always right. Precision is typically evaluated alongside recall, and the F1 score balances the two.
- Profanity Filter
A profanity filter is a tool that scans user-generated text against a list of offensive words and either blocks, masks, or replaces matches with safe substitutes such as asterisks. Modern profanity filters also handle obfuscation, leetspeak, and context.
- Prompt Injection
Prompt injection is an attack against an LLM-powered application in which adversarial instructions — embedded in user input, retrieved documents, or tool outputs — override the developer's system prompt and coerce the model into ignoring its guardrails. It is the top security risk in the OWASP Top 10 for LLM Applications.
- Recall
Recall is a moderation metric that measures what fraction of all the actually harmful items in a dataset were correctly caught by the system. High recall means few false negatives — fewer pieces of harmful content slipping through.
- Red Teaming (AI)
AI red teaming is the practice of adversarially probing a machine learning system — especially large language and multimodal models — to surface unsafe, biased, or policy-violating outputs before deployment. It combines manual attack crafting, automated prompt generation, and structured evaluation against a defined harm taxonomy.
- Rug Pull
A rug pull is a cryptocurrency exit scam in which the developers of a token or project abruptly abandon it and drain the liquidity pool, leaving holders with worthless assets. It typically follows aggressive marketing and artificial price pumping designed to attract retail buyers.
- Sentiment Analysis
Sentiment analysis is a natural language processing technique that classifies text according to its emotional tone, typically as positive, negative, or neutral, and sometimes along finer-grained dimensions such as anger, joy, or fear. It is commonly used alongside moderation signals to understand community health and user experience.
- SIM Swap
A SIM swap is an attack in which a fraudster social-engineers a mobile carrier into transferring a victim's phone number to a SIM they control, allowing them to intercept SMS-based two-factor authentication codes and take over bank, email, and crypto accounts. It is one of the primary reasons security practitioners discourage SMS 2FA for high-value accounts.
- Smishing
Smishing is phishing delivered over SMS, where attackers send text messages impersonating a bank, delivery service, toll authority, or government agency to trick the recipient into clicking a malicious link or handing over credentials. The compressed format of text messages makes it harder for victims to spot the usual red flags of a phishing attempt.
- True Negative
A true negative in content moderation is an instance where benign content is correctly left alone by the moderation system. True negatives represent the everyday content the model rightly ignores, and they keep false positive rates in check.
- True Positive
A true positive in content moderation is an instance where harmful or policy-violating content is correctly identified and acted on by the moderation system. True positives are the cases the model gets right — they drive precision and recall.
- Vishing
Vishing is voice phishing, where an attacker calls the victim and impersonates a bank, tax authority, or tech-support agent to extract credentials, payment details, or remote access to a device. Modern vishing increasingly uses AI voice cloning to impersonate specific individuals the victim knows and trusts.
- Vision-Language Model (VLM)
A vision-language model is a multimodal model that understands images and text together, letting a single model reason about a picture, its caption, and any overlaid text in one pass. VLMs are rapidly replacing stacks of single-purpose image classifiers in moderation pipelines because they can handle open-ended policies and adapt to new violation types without retraining.
- Zero-Shot Classification
Zero-shot classification is the ability of a large language model to assign labels it was never explicitly trained on, using only a natural-language description of each category. It lets teams spin up new policies without collecting and labeling a dedicated training set first.
