Local AI tools have a speed problem that does not get discussed enough. Running an embedding model in the browser via WASM is genuinely fast. Querying 50 stored vectors against a new embedding is trivially fast. But at 5,000 pages, a naive linear scan starts to feel sluggish. At 50,000 pages, it is unusable for interactive search.
This is the problem that vector indexing solves, and it is the reason TraceMind uses Voy rather than a simple array scan.
The baseline: what makes vector search slow
When you type a search query, TraceMind generates an embedding for that query using all-MiniLM-L6-v2. The result is a 384-dimensional vector, a list of 384 floating-point numbers that encode the semantic meaning of your query.
To find relevant pages, you need to find stored page embeddings that are close to the query embedding in that 384-dimensional space. "Close" means having a high cosine similarity or low Euclidean distance.
The brute-force approach: compute the distance between the query vector and every stored vector, sort by distance, return the top results. This is exact, correct, and completely predictable. It is also O(n) in the number of stored vectors, which means it scales linearly with your history size.
For 100 pages, brute force is fine. For 10,000 pages, a distance computation on each of 10,000 x 384-dimensional vectors starts to consume real CPU time. For a browser extension targeting sub-100ms end-to-end search latency, linear scan becomes a bottleneck before you hit the kind of history sizes regular users accumulate over a year.
If you want to understand the embedding layer itself before diving into the indexing layer, the post on how vector embeddings work in your browser covers it thoroughly.
K-d trees: space-partitioning for fast nearest-neighbor lookup
A k-d tree (k-dimensional tree) is a space-partitioning data structure that divides a high-dimensional space into nested regions, building a binary tree from the dataset.
Construction works like this: take all your data points, pick a dimension to split on (often cycling through dimensions), find the median value in that dimension, put all points below the median in the left subtree and all points above in the right subtree. Recurse on each subtree until each node contains a small number of points (a leaf node).
Query works like this: start at the root, traverse the tree by comparing the query point to each split value, prune subtrees that cannot possibly contain a closer neighbor than the best candidate found so far. The pruning is what provides the speedup. Instead of checking every point, you check a fraction of them.
The catch is dimensionality. K-d trees work well in low-to-moderate dimensions (say, 2 to 20). As dimensionality increases, the pruning becomes less effective because there are more dimensions to fail on, and more subtrees that might contain closer neighbors. This is the "curse of dimensionality."
At 384 dimensions, k-d trees are not as efficient as HNSW (Hierarchical Navigable Small World) graphs or IVFPQ (Inverted File with Product Quantization) indexes, which are designed for higher-dimensional vector spaces. But HNSW and IVFPQ are significantly more complex to implement, and their advantages are most pronounced at very large dataset sizes.
For a browser extension managing thousands to tens of thousands of pages, a well-implemented k-d tree via Voy hits the right tradeoff: fast enough to maintain sub-100ms latency, simple enough to run reliably in a WASM environment without external dependencies.
What Voy actually is
Voy is a Rust-implemented vector similarity search library compiled to WASM, which means it runs natively in the browser without any server-side component. The WASM compilation is important: it gives Voy near-native performance while staying entirely within the browser security model.
The library exposes a straightforward API: create an index, add items (each item is an ID plus a vector), query the index with a vector to get the nearest neighbors, serialize and deserialize the index for persistence.
What makes Voy practical for TraceMind is the serialization. The index can be saved to IndexedDB and loaded back on the next browser session without rebuilding it from scratch. As you continue browsing and TraceMind indexes new pages, those new vectors are incrementally added to the existing index rather than requiring a full rebuild.
This incremental update model is critical for a background extension. You do not want the extension running expensive batch operations every time you start your browser. The index stays current as new pages come in, one page at a time.
How Voy integrates with TraceMind's search pipeline
TraceMind uses Reciprocal Rank Fusion (RRF) to combine two ranked lists:
-
Voy's approximate nearest-neighbor results — finds the top-k stored page embeddings closest to the query embedding. This is the semantic signal: pages conceptually related to your query even if they share no exact keywords.
-
FlexSearch full-text results — a BM25-like text index that matches exact terms, partial terms, and document frequency patterns. This catches proper nouns, version numbers, and specific terminology that semantic similarity can sometimes blur over.
RRF takes the two ranked lists and merges them using a formula that rewards documents appearing high in both lists and penalizes documents that only appear well in one. The merged list is the final search ranking.
This hybrid approach matters in practice. Pure semantic search sometimes surfaces documents that are thematically related but not specifically relevant. Pure keyword search misses documents that discuss the same concept with different vocabulary. The combination handles both failure modes better than either alone.
The post on building local-first AI with IndexedDB and WASM goes deeper on the storage and indexing architecture if you want the full picture.
The approximate nearest-neighbor tradeoff
One thing worth being explicit about: approximate nearest-neighbor search does not always return the exact closest vectors. It returns vectors that are very close with high probability, but with some chance of missing a slightly closer neighbor.
For semantic document retrieval, this is an acceptable tradeoff. The goal is not to find the single mathematically closest embedding. The goal is to surface a set of highly relevant documents. Approximate search typically achieves recall rates above 95% at query speeds orders of magnitude faster than exact search. The documents you miss at 95% recall are usually near the relevance boundary anyway.
The recall-speed tradeoff is tunable in most approximate nearest-neighbor implementations. Tighter recall requirements mean more candidate examination and slower queries. Looser recall requirements mean faster queries with slightly more misses.
For interactive search in a browser extension, the fast end of this tradeoff is the right default. Sub-100ms feels instant. 500ms feels slow. The few documents missed at high-speed approximate search are unlikely to materially affect the usefulness of the results.
Why WASM is the right runtime for this
The alternative to WASM for compute-heavy browser work is JavaScript. And JavaScript is genuinely fast for many tasks, but numerical linear algebra (which is what vector similarity computation reduces to) benefits substantially from typed arrays, SIMD instructions, and tight memory layout. WASM provides all of these in a way that JavaScript engines cannot reliably guarantee.
Voy's Rust implementation uses Rust's standard numeric types and memory model, compiled to WASM with near-native floating-point performance. The result is that vector distance computations in Voy run significantly faster than equivalent JavaScript implementations, and with more predictable latency characteristics.
The WASM module is loaded once at extension startup, and subsequent queries are pure compute with no network overhead. This is the performance profile you want for interactive search: fast, predictable, offline-capable.
What happens when the index gets very large
I think it is worth being honest about the scaling ceiling here.
At a few thousand pages, Voy's k-d tree is fast and effective. At tens of thousands of pages, query time starts to grow, though still manageable for typical use. At hundreds of thousands of pages, the k-d tree's dimensionality curse becomes more pronounced and HNSW-class indexes would be the right choice.
For a personal browser history extension, most users will accumulate thousands to maybe tens of thousands of pages over a year of active browsing. TraceMind's free plan retains 365 days of history. The practical corpus size for most users sits well within the range where Voy performs well.
If TraceMind ever needed to scale to much larger corpora, the right path would be to swap the Voy index for a more sophisticated approximate nearest-neighbor implementation. But that is an engineering decision that does not need to be made yet, and making it prematurely would add complexity without improving the experience for actual users.
This is the kind of architectural decision that TraceMind's local-first approach was built around: pick the technology that fits the current problem well, and do not over-engineer for scale that does not exist yet.
The practical result
The end effect of all this, k-d trees, WASM compilation, Voy's incremental index updates, RRF fusion, is that search in TraceMind feels fast regardless of how large your history has grown.
You type a query. The query gets embedded. The Voy index returns approximate nearest neighbors. FlexSearch returns keyword matches. RRF merges them. Results appear. The whole pipeline targets sub-100ms, which is below the threshold where most people perceive latency as lag.
That responsiveness is what makes semantic search useful as a daily tool rather than an occasional novelty. Slow search gets abandoned. Fast search becomes a reflex.
The underlying data structures are the reason that speed is achievable on consumer hardware, in a browser tab, without a server.
