Running Transformers.js in a Chrome Extension: What I Learned Building With Gemma 4

Running Transformers.js in a Chrome Extension: What I Learned Building With Gemma 4

6 0 0

I’ve been tinkering with on-device AI in browser extensions for a while, and the recent Transformers.js demo extension powered by Gemma 4 E2B caught my attention. Not because it’s flashy, but because it actually tackles the hard parts of running local models in a Chrome extension under Manifest V3.

If you’ve tried this yourself, you know the pain: service workers that can suspend, model caching that behaves differently than you expect, and messaging that turns into spaghetti if you’re not careful. The team behind this extension published their architecture, and it’s worth walking through because they made some smart choices.

Who should care about this

This is for developers who want to run local AI features in a Chrome extension without sending user data to some API. The constraints are real: Manifest V3 service workers are ephemeral, you have limited memory, and you can’t just load a 2GB model and hope for the best.

The extension in question is a browser assistant that lives in a side panel. You can ask it questions about the page you’re on, it can extract content, highlight elements, and even search through your browsing history using semantic similarity. All of it runs locally.

The architecture they settled on

The key decision was keeping heavy orchestration in the background service worker and keeping the UI thin. This isn’t revolutionary, but the execution matters.

Three entry points in manifest.json:

  • Background service worker – this is the control plane. Model initialization, agent lifecycle, tool execution, and shared services all live here.
  • Side panel – pure interaction layer. Chat input/output, streaming updates, setup controls. It sends requests to the background and renders results.
  • Content script – the page bridge. DOM extraction and highlight actions. It’s the only part that can directly access the page DOM.

This split avoids duplicate model loads across tabs, keeps the UI responsive, and respects Chrome’s security boundaries. The conversation history lives in the background too, not the UI. When you send a message, the side panel fires an event, the background appends it, runs inference, then pushes the updated message list back.

Messaging: keep it simple

Once you split things across runtimes, messaging becomes the backbone. They defined all messages as typed enums, which is the right call. The pattern is straightforward:

  • Side panel sends actions like AGENT_GENERATE_TEXT or CHECK_MODELS
  • Background processes them and emits updates like MESSAGES_UPDATE or DOWNLOAD_PROGRESS
  • Background talks to content script for EXTRACT_PAGE_DATA or HIGHLIGHT_ELEMENTS

The background is the single coordinator. Side panel and content script are specialized workers that request actions and render results. This pattern has been tried before in other extension architectures, and it works because it’s boring and predictable.

Model loading: the tricky part

They use two models: Gemma 4 for reasoning and tool decisions, and MiniLM for vector embeddings. The split makes sense because you don’t need a full LLM for embedding generation, and keeping them separate lets you manage memory more carefully.

All inference runs in the background service worker. This gives a single model host for all tabs and sessions, avoids duplicate memory usage, and keeps the side panel responsive. But there’s a catch: service workers can be suspended and restarted by Chrome at any time. Model runtime state should be treated as recoverable, not persistent.

The model lifecycle is explicit:

  • CHECK_MODELS inspects what’s already cached and estimates remaining download size
  • INITIALIZE_MODELS downloads and initializes models, emitting progress to the UI
  • Long-lived instances are reused after setup

One detail I appreciate: because models are loaded from the background service worker, artifacts are cached under the extension origin (chrome-extension://) rather than per-website origins. This gives one shared cache for the whole extension install, which is higher than I expected from a default setup.

What I’d do differently

A few things caught my attention. First, the conversation history living entirely in the background means you lose it if the service worker suspends. They mention treating state as recoverable, but I’d want to see persistence to chrome.storage.session or IndexedDB for anything you don’t want to lose.

Second, the messaging contract is clean but verbose. For a production extension, I’d consider wrapping this in a thin abstraction layer to reduce boilerplate. The enums approach works, but every new feature means touching three files.

Third, and this is a personal preference, I’d want better error handling around model downloads. Network failures mid-download can leave partial caches, and the recovery path isn’t always obvious to users. A retry mechanism with exponential backoff would be nice.

The bottom line

This is a solid reference architecture for anyone building AI-powered Chrome extensions. The decisions around runtime separation, messaging patterns, and model lifecycle are well thought out. The codebase is open source, so you can steal the patterns directly.

If you’re building something similar, start with their architecture and adapt from there. Just remember that service workers can die at any moment, so treat your model state as recoverable. And don’t try to run a 7B model in a side panel directly – that way lies pain and unresponsive UIs.

Comments (0)

Be the first to comment!