The Problem with Traditional Search
When you’re indexing 10M+ documents, keyword-based search falls apart. Users don’t search the way documents are written. The semantic gap between query intent and document content is where traditional search engines fail.
Why Transformers Change Everything
Transformer-based embedding models map text into high-dimensional vector spaces where semantic similarity becomes geometric proximity. Two sentences with completely different words but the same meaning end up near each other in this space.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(documents, batch_size=256)
Architecture Overview
Our production pipeline has three stages:
- Ingestion — Documents are chunked, embedded, and indexed into a FAISS vector store
- Query Processing — User queries are expanded, embedded, and matched against the index
- Re-ranking — Top-k candidates are re-ranked using a cross-encoder for precision
The key insight: use a lightweight bi-encoder for initial retrieval (fast, O(1) lookup) and a heavyweight cross-encoder for re-ranking (accurate, but O(n) per candidate).
Achieving Sub-100ms Latency
Three optimizations got us under 100ms p99:
- HNSW indexing in FAISS with
efSearch=64— trades ~2% recall for 10x speed - Query caching with semantic deduplication — if a similar query was seen recently, serve cached results
- Model quantization — INT8 quantization of the embedding model reduced inference time by 3x with negligible quality loss
Lessons Learned
Building search is humbling. The gap between a demo that impresses and a system that serves millions of queries reliably is enormous. Start with evaluation metrics, not architecture diagrams.