BM25 is a remarkably durable algorithm. It was published in 1994, refined into what's now called BM25 (Best Match 25) through the 1990s, and it still powers the core retrieval of Elasticsearch, Solr, Lucene, and most browser history search tools. Thirty years later, it's everywhere.
And for a lot of queries, it works fine. If you search for "Python dict comprehension syntax" and the page title contains "Python dict comprehension syntax," BM25 will find it. The problem shows up when your query and the document don't share vocabulary, which happens constantly when you're searching by intent rather than by keywords.
What BM25 actually does
BM25 scores each document based on how often your query terms appear in that document (term frequency, or TF), weighted down by how common those terms are across all documents (inverse document frequency, or IDF). Documents where rare query terms appear frequently rank higher.
There are two saturation parameters: k1 controls how quickly repeated terms stop adding value, and b controls how much document length is penalized. These are typically tuned to 1.2-2.0 and 0.75, respectively, for standard text retrieval.
The math is straightforward and that's a feature. BM25 is fast, requires no GPU, and produces interpretable results: you can always explain why a document ranked where it did. These properties made it the default choice for decades.
Where BM25 breaks
The failure mode is vocabulary mismatch. If you type "how to stop flickering when React component mounts" but the page you're looking for says "prevent layout shift during hydration," BM25 scores that page close to zero. None of your query terms appear in the document.
Humans understand these are about the same thing. BM25 does not.
Here's a more concrete example from browser history search. You visit a Stack Overflow answer that fixes a race condition in async JavaScript. You remember it was about "event loop blocking" or something like that. You search your history for "async JavaScript problem that blocks the UI." The original page title was "Why does my setTimeout callback run out of order?" with an answer discussing the microtask queue.
BM25 score: near zero. Not a single query term matches the document. Chrome's built-in history search shows nothing useful. You go back to Google and spend 20 minutes finding the same answer again.
This is the Ctrl+H problem. It's not that the search is bad; it's that keyword matching is the wrong tool for intent-based retrieval.
I've written more about this gap in semantic search vs keyword search for knowledge workers if you want a broader framing beyond browser history specifically.
What dense vectors do differently
Dense vector search starts from a different premise: instead of asking "do these words appear in this document," it asks "do these texts mean similar things."
The process works like this. A neural network called a sentence transformer is trained on large datasets of text pairs: sentences that mean the same thing, sentences that are related, sentences that contradict each other. Through training, the network learns to map text onto a point in a high-dimensional space (TraceMind uses 384 dimensions) such that similar meanings land near each other.
After training, you pass any text through the model and get a vector back. The vector isn't interpretable the way BM25 scores are — you can't look at dimension 247 and understand what it represents. But the geometry is reliable: texts that mean similar things have vectors with high cosine similarity.
So "how to stop flickering when React component mounts" and "prevent layout shift during hydration" produce vectors that are close together. A search based on cosine similarity finds the relevant page even with zero keyword overlap.
The all-MiniLM-L6-v2 model
Not all embedding models are equal, and the choice matters a lot for an in-browser context. Large models like OpenAI's text-embedding-3-large produce better embeddings but require server-side API calls, network round-trips, and sending your data to a third party.
TraceMind uses all-MiniLM-L6-v2, a model that's been distilled down to run efficiently in the browser. It produces 384-dimension embeddings. It runs on WebGPU when available (which gives it GPU acceleration without leaving the browser context) and falls back to WebAssembly when WebGPU isn't supported. I've found it produces search quality that's genuinely useful for browser history retrieval, where the context is your own pages and queries tend to be informal and intent-based.
The tradeoff is that all-MiniLM-L6-v2 is less powerful than larger models on complex semantic tasks. It doesn't do multi-step reasoning. But for "find me the page I read about X," it works very well. You can read more about how vector embeddings work in your browser if you want to go deeper on the technical mechanics.
Dense vectors have their own failure modes
It's worth being honest about where dense vector search breaks down too.
Exact-match queries. If you search for a specific error code — "ECONNREFUSED 127.0.0.1:5432" — BM25 will find pages containing that exact string better than dense vectors will. The vector model might return pages about database connection errors in general, which isn't what you want when you need the exact error.
Rare technical terms. Model training data is biased toward general text. Obscure library names, internal project terminology, or very new APIs may not embed meaningfully because the model has little training signal for them.
Specificity and precision. Dense vectors are good at capturing topic but can be fuzzy about specifics. "Python 2 syntax" and "Python 3 syntax" might end up closer together than you'd want because they're both about Python syntax.
These failure modes are real, and they're why using only dense vectors is also the wrong answer.
Why Reciprocal Rank Fusion is the actual solution
The right approach isn't to pick BM25 or dense vectors. It's to run both and merge the results intelligently. TraceMind does this using Reciprocal Rank Fusion (RRF).
RRF works by taking the ranked result lists from multiple retrieval methods and combining them into a single ranking. A document's combined score is based on its rank position in each list, not its raw score. The formula is:
RRF_score = sum(1 / (k + rank_in_method))
where k is a constant (typically 60) that prevents outlier ranks from dominating. A document that ranks 3rd in dense vector search and 5th in BM25 gets a higher combined score than one that ranks 1st in only one method.
In practice, this means:
- Intent queries ("that thing about React hydration") are dominated by vector results, which is correct
- Exact-match queries ("ECONNREFUSED 5432") are dominated by keyword results, which is also correct
- Queries that are both intent-based and keyword-rich get results that rank well on both signals
I've tested this on my own indexed history (around 800 pages, heavy developer content). RRF consistently beats either method alone. The worst cases are specific error code searches where the vector search occasionally pulls in tangentially related pages. In those cases, BM25 is the anchor that keeps the result useful.
What this means for browser history search
The Ctrl+H mental model is broken for knowledge workers. We don't remember what we searched for, we don't remember exact titles, and we search by how we think about the topic, not how the author wrote about it.
A hybrid BM25 + dense vector system with RRF is the correct architecture for this use case. It handles the full range of how humans actually search their history: fuzzy intent queries, exact error codes, half-remembered article titles, and everything in between.
If you want to see this in practice rather than just reading about it, TraceMind is a Chrome extension (also works on Brave and Edge) that applies this exact architecture to your browser history. Everything runs locally. Sub-100ms latency. Your data stays on your machine.
The best Chrome history extensions for 2026 post puts this approach in context against simpler tools if you want to understand the full landscape before deciding.
