Finetuning Multimodal Embedding Models with Sentence Transformers: A Practical Walkthrough

Tom Aarsen is back with another practical post on Sentence Transformers, and this time it’s about training and finetuning multimodal models. If you missed his earlier post on using these models for text, images, audio, and video, go read that first. This one assumes you’ve got the basics down and want to get your hands dirty with actual training.

The running example here is Visual Document Retrieval (VDR). The idea: given a text query like “What was the company’s Q3 revenue?”, the model needs to find the right document screenshot from a pile of thousands. This is a very different beast from matching product photos to descriptions. You need to understand document layouts, charts, tables, and text together. General-purpose multimodal models like Qwen/Qwen3-VL-Embedding-2B are trained on all sorts of data, so they’re decent at everything but rarely great at any one thing. Finetuning changes that.

Aarsen finetuned that same Qwen model on VDR data and saw NDCG@10 jump from 0.888 to 0.947. That’s not just a small bump — it beat every other model he tested, including ones four times larger. That’s the kind of result that makes finetuning worth the effort.

What you need to train

The training pipeline for multimodal models uses the same SentenceTransformerTrainer you’d use for text-only models. The components are the same: model, dataset, loss function, training arguments, evaluator, and trainer. The only real difference is your dataset now includes images (or other modalities), and the model’s processor handles image preprocessing automatically. No manual resizing or tokenization gymnastics.

Model

You’ve got two paths. The straightforward one: finetune an existing multimodal embedding model. Pass the model ID to SentenceTransformer, optionally add model_kwargs and processor_kwargs to control things like attention implementation, precision, and image resolution bounds. Higher max_pixels means better quality but more memory. Aarsen’s example uses flash attention and bfloat16, which is sensible for modern GPUs.

The second path: start from a raw VLM checkpoint that hasn’t been trained for embeddings yet. Sentence Transformers will try to figure out the architecture, infer modalities from the processor, and set up pooling automatically. If it gets something wrong, you can edit the saved sentence_bert_config.json to fix modality settings or forward methods. This is more experimental but opens up more model choices.

Dataset

For VDR, you need pairs of text queries and document images. Aarsen used a dataset where each query maps to a specific document page screenshot. The dataset format is straightforward: you can use a list of (query, image) pairs, or a Hugging Face dataset with columns for text and image. The model’s processor handles the image preprocessing, so you don’t need to resize or normalize manually.

One thing I appreciate: he’s explicit about the dataset format. No vague “prepare your data appropriately” nonsense. You need an anchor (the query) and a positive (the image). That’s it. For training, you typically want multiple positives per query or multiple queries per image to make the loss function work well.

Loss Function

This is where things get interesting. Aarsen uses CachedMultipleNegativesRankingLoss with MatryoshkaLoss. The first one is a variant of the standard MultipleNegativesRankingLoss that caches embeddings to speed up training. It’s designed for retrieval tasks where you have a batch of (query, positive) pairs and treat everything else in the batch as negatives. The caching trick means you don’t recompute embeddings for every negative every step, which saves a lot of time.

MatryoshkaLoss is for training models that produce embeddings at multiple dimensions. You train the model to output good embeddings at, say, 256, 128, 64, and 32 dimensions simultaneously. This lets you trade off accuracy for speed at inference time without retraining. Aarsen’s results show that even at 32 dimensions, the finetuned model beats the base model at 256 dimensions. That’s a big deal for production systems where latency matters.

Training Arguments and Evaluator

Standard stuff here. You set batch size, learning rate, number of epochs, warmup steps, and so on. The evaluator runs during training to track NDCG@10 on a held-out set. Aarsen used a custom evaluator that computes retrieval metrics on the VDR task. This is critical — without an evaluator that matches your actual use case, you’re flying blind.

Trainer

The SentenceTransformerTrainer ties everything together. You pass the model, dataset, loss function, training arguments, and evaluator, then call train(). It handles batching, gradient accumulation, logging, checkpointing, and evaluation. Nothing exotic, but it works.

Results that matter

Aarsen’s finetuned model, trained on VDR data, achieved NDCG@10 of 0.947. The base model was at 0.888. That’s a 6.6% relative improvement. More importantly, it outperformed every other VDR model he tested, including some that were four times larger. This confirms what many of us have suspected: for specialized retrieval tasks, a well-finetuned small model can beat a generic large model.

The Matryoshka dimensions experiment is also revealing. At 256 dimensions, the finetuned model hit 0.947. At 128 dimensions, it was 0.941. At 64 dimensions, 0.929. At 32 dimensions, 0.903. Compare that to the base model at 256 dimensions: 0.888. So even at 32 dimensions, the finetuned model is better than the base model at 8x the dimensions. For anyone deploying retrieval systems at scale, this is the kind of efficiency gain that matters.

Training multimodal rerankers

Aarsen also briefly covers training multimodal reranker models. The approach is similar: you need a model that takes a query and a candidate document (as text or image) and outputs a relevance score. The dataset format changes slightly — you need (query, candidate, label) triples where the label indicates relevance. The loss function is typically a cross-entropy or contrastive loss. The trainer is the same SentenceTransformerTrainer, just with a different loss and model class.

I won’t go into full detail here because the post covers it, but the key takeaway is that the same infrastructure works for both embedding and reranker models. If you can train one, you can train the other.

What I’d add

Aarsen’s post is solid, but I’d love to see more on data curation. VDR datasets aren’t easy to come by. You either need to create your own by pairing queries with document screenshots, or find a domain-specific dataset. The quality of your training data matters more than the model architecture or loss function. Aarsen doesn’t spend much time on this, but it’s where most of the work actually goes.

Also, the post assumes you have a decent GPU. Qwen3-VL-Embedding-2B is a 2B parameter model. Training it on a consumer GPU is possible with gradient checkpointing and low precision, but it’s not fast. If you’re working with limited hardware, you might want to start with a smaller VLM or use LoRA-style finetuning. Sentence Transformers doesn’t natively support LoRA yet, but you can wrap the model with PEFT and train it separately.

Bottom line

This is a practical, no-nonsense guide to finetuning multimodal embedding models. The results speak for themselves: a 6.6% improvement on a challenging retrieval task, beating models four times larger. If you’re working on visual document retrieval, multimodal search, or any task that requires understanding images and text together, this is worth your time.

The post is on the Hugging Face blog, and the code is on GitHub. Go read it, try it on your own data, and see how much you can squeeze out of a finetuned model.