Search Pipeline Architecture

Smart Search v0.12.0 -- multi-stage hybrid retrieval with cross-encoder reranking and MMR diversity

0
Query Input always active

Raw query from MCP tool, REST API, CLI, or desktop Quick Search. Normalized: leading/trailing punctuation stripped, internal punctuation preserved.

1
Query Preprocessor query_preprocessor.py

Splits into two paths. FTS5 path: removes stopwords, OR-joins multi-term queries, preserves quoted phrases. Embedding path: whitespace normalization only -- the embedding model handles semantic context.

Stopwords ~60 English
Latency <0.1ms
Memory 0 MB
2a
FTS5 BM25 fts.py

Lexical retrieval via SQLite FTS5. Porter stemming handles inflections. BM25 ranks by term frequency, inverse document frequency, and length normalization.

score = BM25(tf, idf, dl, avgdl, k1=1.2, b=0.75)
Latency <5ms
Over-fetch 5x limit
2b
Vector Search store.py + embedder.py

Dense retrieval via LanceDB. Query embedded by snowflake-arctic-embed-m-v2.0 (256-dim, int8 ONNX), compared to indexed chunk embeddings via cosine similarity.

similarity = 1 - cosine_distance(query_vec, chunk_vec)
Latency ~50ms
Over-fetch 5x limit
3
Reciprocal Rank Fusion fusion.py

Merges both ranked lists without requiring score calibration. Documents appearing in both lists receive scores from both, naturally boosting consensus results. Scores normalized to 0-1.

RRF(d) = Σ 1 / (k + rank(d))  where k = 60
Constant k 60
Latency <1ms
Reference Cormack et al. 2009
4
Cross-Encoder Reranking reranker.py configurable

Jointly scores each (query, chunk) pair through a cross-encoder transformer. Unlike the bi-encoder (stage 2b) which embeds query and document independently, the cross-encoder sees both texts together with full attention, catching subtle relevance signals. Reranks the top-20 fusion results.

score = CrossEncoder(concat(query, [SEP], chunk_text))
Model TinyBERT-L-2-v2
Parameters 14M
Latency 30-60ms / 20 pairs
Memory ~50MB (lazy)
Reference Nogueira & Cho 2019
5
MMR Diversity Selection mmr.py configurable

Maximum Marginal Relevance eliminates redundant results. Greedily selects the next result that maximizes relevance while penalizing similarity to already-selected results. Ensures 10 results cover 10 topics, not 3.

MMR(d) = λ · relevance(d) - (1 - λ) · max_sim(d, selected)  where λ = 0.8
Lambda 0.8
Latency <1ms
Memory 0 MB
Reference Carbonell & Goldstein 1998
Ranked Results

Top-N results with rank, normalized score (0-1), chunk text, source path, and section path. Delivered via MCP tools, REST API, CLI, or desktop Quick Search.

Component RAM (active) RAM (idle) Latency On Disk
Bi-encoder (snowflake) ~400 MB 0 (unloads) ~50ms 297 MB
Cross-encoder (TinyBERT) ~50 MB 0 (unloads) 30-60ms ~15 MB
FTS5 index Shared with SQLite <5ms ~ corpus size
RRF + MMR negligible 0 <2ms 0
Total (search) ~650 MB peak ~200 MB ~100-150ms ~312 MB

References

  1. Cormack, G.V., Clarke, C.L.A., & Buettcher, S. (2009). Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods. SIGIR '09.
  2. Nogueira, R. & Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085.
  3. Thakur, N. et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021.
  4. Carbonell, J. & Goldstein, J. (1998). The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR '98.
  5. Robertson, S. & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.