Vision-Language Model (VLM)
Last reviewed by Moderation API
A vision-language model is a multimodal model that understands images and text together, letting a single model reason about a picture, its caption, and any overlaid text in one pass. VLMs are rapidly replacing stacks of single-purpose image classifiers in moderation pipelines because they can handle open-ended policies and adapt to new violation types without retraining.
