PAPER PLAINE

Fresh research, simply explained. Updates twice daily.

One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

How a single confusing text can fool systems that match images to captions

Researchers found a critical weakness in CLIP and similar image-text matching systems: a single generic piece of text can be artificially close to nearly every image in a dataset, tricking the system into giving it high similarity scores even when it's meaningless. This reveals that these widely-used systems rely on flawed geometry in their internal representation space, making them vulnerable to subtle manipulation.

Image-to-text systems power real applications—from photo search to automated caption evaluation—and companies rely on them to be robust. This vulnerability means a single malicious or accidental hub text could poison search results or break evaluation metrics that measure whether AI-generated captions match human standards, undermining trust in systems used for content moderation, accessibility, and quality assurance.