Smart Search - Search Pipeline Architecture

Smart Search v0.12.0 -- multi-stage hybrid retrieval with cross-encoder reranking and MMR diversity

Query Input always active

Raw query from MCP tool, REST API, CLI, or desktop Quick Search. Normalized: leading/trailing punctuation stripped, internal punctuation preserved.

Query Preprocessor query_preprocessor.py

Splits into two paths. FTS5 path: removes stopwords, OR-joins multi-term queries, preserves quoted phrases. Embedding path: whitespace normalization only -- the embedding model handles semantic context.

Stopwords ~60 English

Latency <0.1ms

Memory 0 MB

FTS5 BM25 fts.py

Lexical retrieval via SQLite FTS5. Porter stemming handles inflections. BM25 ranks by term frequency, inverse document frequency, and length normalization.

score = BM25(tf, idf, dl, avgdl, k1=1.2, b=0.75)

Latency <5ms

Over-fetch 5x limit

Vector Search store.py + embedder.py

Dense retrieval via LanceDB. Query embedded by snowflake-arctic-embed-m-v2.0 (256-dim, int8 ONNX), compared to indexed chunk embeddings via cosine similarity.

similarity = 1 - cosine_distance(query_vec, chunk_vec)

Latency ~50ms

Over-fetch 5x limit

Reciprocal Rank Fusion fusion.py

Merges both ranked lists without requiring score calibration. Documents appearing in both lists receive scores from both, naturally boosting consensus results. Scores normalized to 0-1.

RRF(d) = Σ 1 / (k + rank(d)) where k = 60

Constant k 60

Latency <1ms

Reference Cormack et al. 2009

Cross-Encoder Reranking reranker.py configurable

Jointly scores each (query, chunk) pair through a cross-encoder transformer. Unlike the bi-encoder (stage 2b) which embeds query and document independently, the cross-encoder sees both texts together with full attention, catching subtle relevance signals. Reranks the top-20 fusion results.

score = CrossEncoder(concat(query, [SEP], chunk_text))

Model TinyBERT-L-2-v2

Parameters 14M

Latency 30-60ms / 20 pairs

Memory ~50MB (lazy)

Reference Nogueira & Cho 2019

MMR Diversity Selection mmr.py configurable

Maximum Marginal Relevance eliminates redundant results. Greedily selects the next result that maximizes relevance while penalizing similarity to already-selected results. Ensures 10 results cover 10 topics, not 3.

MMR(d) = λ · relevance(d) - (1 - λ) · max_sim(d, selected) where λ = 0.8

Lambda 0.8

Latency <1ms

Memory 0 MB

Reference Carbonell & Goldstein 1998

Ranked Results

Top-N results with rank, normalized score (0-1), chunk text, source path, and section path. Delivered via MCP tools, REST API, CLI, or desktop Quick Search.

Component	RAM (active)	RAM (idle)	Latency	On Disk
Bi-encoder (snowflake)	~400 MB	0 (unloads)	~50ms	297 MB
Cross-encoder (TinyBERT)	~50 MB	0 (unloads)	30-60ms	~15 MB
FTS5 index	Shared with SQLite		<5ms	~ corpus size
RRF + MMR	negligible	0	<2ms	0
Total (search)	~650 MB peak	~200 MB	~100-150ms	~312 MB

Search Pipeline Architecture

References