Sentence Transformers has always been the go-to library for embedding and reranker models in Python, especially for RAG pipelines and semantic search. With the v5.4 update, it finally catches up to the multimodal trend. You can now encode and compare texts, images, audio, and video using the same API you already know. No separate pipelines, no awkward workarounds.
I’ve been using Sentence Transformers for years, and this is a genuinely useful addition. But let’s be honest: multimodal embedding is still a young field, and the results reflect that. More on that later.
What’s Actually New
Traditional embedding models turn text into fixed-size vectors. Multimodal ones map text, images, audio, and video into a shared embedding space. That means you can compare a text query against image documents, or find video clips matching an audio description, all with the same similarity functions.
Similarly, reranker models (Cross Encoders) now handle mixed-modality pairs. You can score the relevance of a text-image pair against a query, which opens up visual document retrieval, cross-modal search, and multimodal RAG. This is where things get interesting.
Installation: Pick Your Modalities
The library itself is the same, but you need extra dependencies for each modality. If you only need images, install the image extras. If you need everything plus training support, there’s a combo package:
pip install -U "sentence-transformers[image]"
pip install -U "sentence-transformers"
pip install -U "sentence-transformers"
pip install -U "sentence-transformers[image,video,train]"
A word of caution: VLM-based models like Qwen3-VL-2B need at least 8 GB of VRAM. The 8B variants want about 20 GB. If you don’t have a GPU, stick with CLIP-based models or text-only ones. On CPU, these multimodal models are painfully slow.
Embedding Models: The Basics
Loading a multimodal model is identical to a text-only one:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")
The model auto-detects supported modalities. No extra config needed.
Encoding images works with URLs, local paths, or PIL Image objects:
img_embeddings = model.encode([
"https://example.com/car.jpg",
"https://example.com/bee.jpg",
])
Cross-modal similarity is where it gets interesting. You can compute similarities between text and image embeddings directly:
text_embeddings = model.encode([
"A green car parked in front of a yellow building",
"A bee on a pink flower",
])
similarities = model.similarity(text_embeddings, img_embeddings)
The results are… okay. The correct matches get higher scores, but even the best scores hover around 0.5-0.7. That’s the modality gap at work: embeddings from different modalities cluster in separate regions of the space. Cross-modal similarities are lower than text-to-text, but the relative ordering is preserved, so retrieval still works.
Encoding Queries vs Documents
For retrieval, use encode_query() and encode_document() instead of plain encode(). Many models prepend different instruction prompts depending on whether the input is a short query or a long document. This makes a real difference in retrieval quality.
query_emb = model.encode_query("Find images of red cars")
doc_emb = model.encode_document("https://example.com/red-car.jpg")
Reranker Models: Ranking Mixed Modalities
Rerankers work similarly but score relevance directly instead of computing embeddings:
from sentence_transformers import CrossEncoder
model = CrossEncoder("Qwen/Qwen3-VL-Reranker-2B")
scores = model.predict([
("A red car", "https://example.com/red-car.jpg"),
("A bee on a flower", "https://example.com/bee.jpg"),
])
This is useful for the retrieve-and-rerank pattern: use an embedding model for initial retrieval, then rerank the top candidates with a cross-encoder. It’s slower but more accurate.
What I Don’t Like
First, the modality gap is real. Cross-modal similarity scores are low enough that threshold-based filtering is tricky. You can’t just say “similarity > 0.8” because nothing will pass. You have to rely on relative ordering.
Second, GPU requirements are steep. The 2B parameter models need 8 GB VRAM, and the 8B ones need 20 GB. That rules out most consumer GPUs for the larger models. Cloud GPU or Colab is the practical option.
Third, audio and video support feels bolted on. The API works, but model availability is thin. Most supported models are text-image only. Audio and video models exist but are experimental.
Supported Models
The library supports several families:
- Qwen3-VL: Both embedding and reranker variants, 2B and 8B parameters. Best overall quality but GPU-heavy.
- CLIP-based: Lighter, CPU-friendly, but lower quality for cross-modal tasks.
- SigLIP: Similar to CLIP but trained with a different loss function. Good middle ground.
Check the Sentence Transformers model hub for the full list. New models are being added regularly.
Is It Worth It?
If you’re building multimodal RAG or cross-modal search, this is the easiest way to get started. The API is clean, the documentation is decent, and the library is well-maintained. Just be prepared for the modality gap and the GPU requirements.
If you only need text-to-text retrieval, stick with the existing Sentence Transformers models. They’re faster, cheaper, and more mature.
For everyone else: this is a solid step forward. Not perfect, but usable.
Comments (0)
Login Log in to comment.
Be the first to comment!