Back to Glossary

What is F1 Score?

Last reviewed by Moderation API

The F1 score is a single number that summarizes how well a binary classifier performs, and it shows up constantly in content moderation work. It ranges from 0 to 1, where 1 means the model got every decision right. Higher is better.

Precision and recall

F1 is built on two underlying metrics, and it only makes sense once you understand both.

  • Precision answers the question "of everything the model flagged as harmful, how much actually was harmful?" It is true positives divided by (true positives + false positives).
  • Recall answers "of all the harmful content that actually exists, how much did the model catch?" It is true positives divided by (true positives + false negatives).

A classifier can be great at one and terrible at the other. A model that flags only the most obvious slurs will have near-perfect precision but awful recall. A model that flags every post will have perfect recall and useless precision.

Calculating the F1 score

The formula is the harmonic mean of precision and recall:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

The harmonic mean punishes imbalance. If either precision or recall drops close to zero, F1 drops with it, even if the other metric is high.

That is the whole point of using F1 instead of a plain average: it refuses to let a model look good by cheating on one side of the tradeoff.

Why it matters for content moderation

Moderation is a domain where both kinds of errors carry real cost. A false negative leaves harmful content on the platform and risks user harm. A false positive removes legitimate speech, frustrates users, and generates appeals work for the operations team. F1 gives a quick read on whether a classifier is keeping both problems in check at once, which is why it typically shows up on the first page of any model evaluation report.

When F1 is the right metric

F1 is most useful when false positives and false negatives carry comparable weight.

It is also preferable to raw accuracy when the positive class is rare, which is almost always the case in moderation. If 1% of posts contain hate speech, a model that predicts "not hate speech" for everything scores 99% accuracy and catches nothing. F1 would score that model at zero and make the problem obvious.

Limitations

F1 assumes precision and recall matter equally, and that assumption breaks often in the real world. For CSAM detection, missing a true positive is catastrophic while a false positive is merely inconvenient, so teams usually optimize for recall and accept a lower precision floor. For automated account bans, the cost trade is reversed. The F-beta score generalizes F1 by letting you weight recall over precision (or vice versa), and it is worth reaching for whenever the costs are asymmetric.

F1 also gets noisy on heavily imbalanced datasets, and it does not tell you anything about confidence calibration, drift, or the distribution of errors across user groups.

Treat it as a starting point for evaluation, not the final verdict on a model.

Find out what we'd flag on your platform