The assumption I keep hearing is that running AI models in a browser makes everything slow. More memory, more CPU, janky scrolling, drained battery. I understand where that comes from. Some early browser AI demos were genuinely painful.
But after six months of running Transformers.js inside a Chrome extension every day, I can tell you the reality is more nuanced than the assumption. The performance story depends almost entirely on model choice, quantization, and where in the browser's architecture you run the inference. Get those three things right and a browser-based ML model can be nearly invisible to the user.
Here's what I've actually observed.
What Transformers.js is and why it matters
Transformers.js is the JavaScript port of HuggingFace's Transformers library. It allows you to run pre-trained models, text embedders, classifiers, translators, image processors, in the browser with no backend required. The library handles the heavy lifting: loading model weights, managing the ONNX Runtime backend, and providing a familiar HuggingFace-style API.
The execution happens either through WebAssembly (WASM) or WebGPU. WASM runs the model on the CPU in a sandboxed environment, compatible with every modern browser. WebGPU uses the GPU for parallel computation and is significantly faster for larger models, but requires hardware and browser support that isn't universal yet.
TraceMind uses Transformers.js to run the all-MiniLM-L6-v2 model locally for semantic search. When you visit a page, the extension extracts the text content, passes it through the model to generate a 384-dimensional embedding vector, and stores that vector in IndexedDB alongside the page metadata. When you search, your query goes through the same process and gets compared against all stored vectors.
The interesting part from a performance standpoint is that this all happens inside an extension, not a regular webpage. That architectural choice has significant implications.
Extension service workers vs. page scripts
Running ML inference in a Chrome extension is different from running it in a webpage. Extensions have access to a persistent background service worker that runs independently of any active tab.
This matters because inference in a service worker doesn't compete with the main thread of whatever page you're viewing. When TraceMind processes a new page you've visited, that computation happens in the background, isolated from your current tab. You won't see frame drops or input lag from the embedding generation, even if it's happening at the same time you're scrolling through a different page.
If the inference ran in a content script injected into the active page, the story would be different. Content scripts share a JavaScript context with the page, so heavy computation there would affect rendering. But TraceMind's architecture correctly isolates the ML work.
I've covered the broader on-device AI in browser extensions topic separately, but the architecture decision is the single most important factor in whether a browser AI extension feels fast or sluggish.
The quantization question: how 30MB beats 120MB in every way
The all-MiniLM-L6-v2 model in its full float32 precision is around 90-120MB depending on the format. That's a significant payload to load into browser memory, and it takes a while to parse and initialize.
The quantized version, using 8-bit integer weights instead of 32-bit floats, is approximately 30MB. That's not just a smaller download. It means:
- Faster initial load (roughly 3-4x)
- Lower memory footprint (roughly 4x less RAM for the weights)
- Faster inference on CPU because more of the model fits in cache
- No meaningful reduction in embedding quality for general text similarity tasks
For text embedding specifically, quantization to int8 has been well-studied and the quality loss is minimal. The cosine similarity between embeddings produced by the quantized and full-precision versions of all-MiniLM-L6-v2 is extremely high, typically above 0.98 on standard benchmarks. For the purpose of finding "pages about React hook optimization patterns" in your history, the difference is undetectable.
The reason this matters so much for browser deployment is that a 30MB model is something a user reasonably waits for on first install. A 120MB model would take significantly longer on an average connection, would consume more memory during a browsing session, and would be more likely to trigger Chrome's aggressive memory management (which can evict extension service workers and force a cold reload).
What the numbers actually look like
I've been tracking TraceMind's resource usage informally over the last six months using Chrome's task manager and the Performance panel.
Memory. The extension's service worker sits at 35-60MB when the model is loaded and warm. This includes the model weights, the ONNX runtime, and working memory for the current inference. If Chrome evicts the service worker due to inactivity, it restarts clean and reloads the model from the cached binary, which takes 2-4 seconds.
CPU during indexing. Processing a new page, extracting text with Mozilla Readability and running one inference pass through the model, takes between 15ms and 60ms depending on page length and hardware. My M2 MacBook Pro is at the fast end. An older Intel machine with no GPU acceleration takes closer to 50-80ms. Neither is user-perceptible.
CPU during search. Embedding the search query and scoring it against the index takes under 100ms for indexes up to several thousand pages. This is fast enough to feel instant in the UI.
Battery. Honestly, I was most skeptical about this one. After six months of daily use on a laptop, I haven't attributed any battery drain to TraceMind. The model only runs during indexing (triggered by page visits) and search (triggered by user action). It's not running a continuous loop. Total CPU time attributable to the extension per hour of browsing is probably under 2 seconds.
WebGPU: faster but more complicated
The WASM baseline is solid, but WebGPU is noticeably faster for embedding generation where supported. On a machine with a modern GPU and Chrome 120+, embedding a medium-length page drops from ~40ms to ~8ms on my hardware.
The catch is reliability. WebGPU support varies significantly across GPU models and driver versions. Some machines that technically support WebGPU still produce incorrect outputs from certain model operations, which is worse than falling back to WASM. Transformers.js handles the fallback gracefully, but you need to test your specific model carefully before committing to WebGPU as the primary backend.
TraceMind defaults to WASM for consistency, with the option to enable WebGPU acceleration where available. For most users, the WASM performance is already fast enough that WebGPU feels like a nice bonus rather than a necessity.
The cold start problem
One genuine performance concern with browser-based ML is cold start latency. When the extension service worker starts fresh (after browser restart, or after Chrome evicts it due to inactivity), it needs to re-initialize the model before processing the first page.
This initialization takes 2-5 seconds on a typical machine. During that window, if you visit and immediately leave a page, that page might not get indexed. The extension queues missed pages and processes them once the model is ready, but there's a brief gap.
For search, the cold start means the first query after a long idle period might feel slow. Subsequent queries in the same session are fast because the model stays in memory.
This is a known limitation of service worker-based ML. The alternative, keeping the service worker permanently alive, isn't possible in Chrome's extension model. It's a trade-off I think is reasonable given the overall benefits.
Compression and deduplication: the other performance layer
Model inference isn't the only performance-sensitive part of the system. Storage and retrieval of indexed pages also matters at scale.
TraceMind uses lz-string compression on stored page content, achieving 50-70% size reduction on typical text. Combined with SHA-256 based deduplication (pages you revisit don't get re-indexed), the IndexedDB storage stays manageable even after months of heavy browsing.
The building local-first AI post goes into more detail on the storage architecture decisions, including why IndexedDB was chosen over alternatives like SQLite-WASM or localStorage.
Comparison with cloud-based alternatives
The standard objection to local ML is that cloud-based embeddings from a provider like OpenAI or Cohere are higher quality than a 30MB quantized model running in WASM.
That's true in absolute terms. The ada-002 or text-embedding-3-small models from OpenAI are larger, trained on more data, and produce higher-quality embeddings for some tasks.
But for browser history search, the comparison isn't really fair. all-MiniLM-L6-v2 was specifically trained and evaluated for semantic similarity tasks. Its embeddings are well-calibrated for the task of matching a natural-language query against a corpus of web page text. In my experience, it performs well enough that I've never thought "I wish this were more accurate." The practical ceiling for this use case is lower than people assume.
The cloud alternative also means sending every page you visit to an API, which is a non-starter for privacy-conscious users. I discussed the broader privacy landscape in my privacy-first extensions comparison, but the core point is that local inference removes a whole category of risk.
What this means for developers building browser AI
If you're considering building an extension that does ML inference, here's what I'd take from six months of running this in production:
Choose a model that's well-suited to the specific task and small enough to load comfortably in a browser. For text similarity, all-MiniLM-L6-v2 quantized hits a sweet spot. Resist the temptation to use a larger model unless your task genuinely requires it.
Run inference in a service worker or offscreen document, never in a content script. The architectural isolation is what keeps the user experience clean.
Implement caching and deduplication aggressively. Most pages users revisit don't change between visits. Skipping redundant inference work is free performance.
Plan for cold starts. Queue work that arrives before the model is ready, and give users feedback when initialization is happening.
Test battery and memory on representative hardware, not just your development machine. The MacBook Pro experience and the mid-range Windows laptop experience can be very different.
Browser-based AI is more practical than most people realize. The models are getting smaller, the runtimes are getting faster, and the browser APIs are maturing. For privacy-sensitive use cases especially, I think local inference is increasingly the right default.
If you want to see this in action, TraceMind is free to install and starts indexing immediately.
