One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

Computer Science May 3, 2026

One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

How a single confusing text can fool systems that match images to captions

Hiroyuki Deguchi, Katsuki Chousa, Yusuke Sakai
arXiv:2604.27674

Summary

Researchers found a critical weakness in CLIP and similar image-text matching systems: a single generic piece of text can be artificially close to nearly every image in a dataset, tricking the system into giving it high similarity scores even when it's meaningless. This reveals that these widely-used systems rely on flawed geometry in their internal representation space, making them vulnerable to subtle manipulation.

Why it matters

Image-to-text systems power real applications—from photo search to automated caption evaluation—and companies rely on them to be robust. This vulnerability means a single malicious or accidental hub text could poison search results or break evaluation metrics that measure whether AI-generated captions match human standards, undermining trust in systems used for content moderation, accessibility, and quality assurance.

Read on arXiv Posted on arXiv · Apr 30, 2026