Back to Glossary

Vision-Language Model (VLM)

Last reviewed by Moderation API

A vision-language model is a multimodal model that understands images and text together, letting a single model reason about a picture, its caption, and any overlaid text in one pass. VLMs are rapidly replacing stacks of single-purpose image classifiers in moderation pipelines because they can handle open-ended policies and adapt to new violation types without retraining.

Find out what we'd flag on your platform