Architecture
How LegacyLens turns a natural language question into a grounded, streaming answer from legacy code.
Pipeline
User Query
│
┌─────▼─────────┐
│ Alias Norm. │ Normalize legacy identifiers (COBOL paragraphs, copybooks)
└─────┬─────────┘
│
┌─────▼─────────┐
│ Parallel │ Dense: voyage-code-3 (1024d, 32K context)
│ Encoding │ Sparse: SPLADE via FastEmbed + code identifier tokenizer
└─────┬─────────┘
│
┌─────▼─────────┐
│ Qdrant │ Hybrid search with Reciprocal Rank Fusion (RRF)
│ Hybrid Search │ Boolean metadata filtering by codebase + language
└─────┬─────────┘
│ top-20
┌─────▼─────────┐
│ Cross-Encoder │ ms-marco-MiniLM-L-6-v2 reranker
│ Rerank │ Re-scores by semantic relevance to query
└─────┬─────────┘
│ top-5
┌─────▼─────────┐
│ LLM │ Cerebras Qwen3-Coder (primary, 2000+ t/s)
│ Generation │ Groq fallback, streaming via SSE
└─────┬─────────┘
│
Streaming Response + Source CitationsDesign Decisions
Custom Pipeline, not LangChain
~3K LOC custom RAG pipeline
LangChain adds 5 abstraction layers for what is ~100 lines of retrieval + generation glue. Custom code means full control over the embedding strategy, search tuning, and streaming behavior.
Qdrant with Native Hybrid Search
Qdrant over ChromaDB, pgvector, Pinecone
Only Qdrant supports dense + sparse vectors with built-in Reciprocal Rank Fusion in a single query. ChromaDB has no hybrid search, pgvector requires manual RRF, Pinecone is paid.
voyage-code-3 Embeddings
1024d, 32K context window
+13.8% code retrieval quality vs jina-v2. The 32K context window handles large COBOL paragraphs without truncation. 200M free tokens covers development and production.
SPLADE Sparse Vectors
SPLADE via FastEmbed + custom code identifier tokenizer
Dense embeddings miss exact identifier matches (PERFORM CALCULATE-TOTALS). SPLADE provides lexical precision where semantic search fails. Custom tokenizer handles COBOL naming conventions.
Metadata-Header Embedding Strategy
Prepend [FILE: x] [FUNCTION: y] [DIVISION: z] headers before embedding
COBOL is absent from embedding model training data. Headers inject natural-language anchors that bridge the vocabulary gap — the embedding model understands "FILE" and "FUNCTION" even if it has never seen COBOL.
Cerebras Primary, Groq Fallback
Cerebras Qwen3-Coder (1M tokens/day free, 2000+ t/s)
Cerebras is 3-4x faster than Groq for code generation with a generous free tier. Groq provides automatic fallback (30 RPM free). Both support streaming SSE.
Performance
| Metric | Value |
|---|---|
| Retrieval latency | 149–259ms |
| Full query (Cerebras) | 539–1,325ms |
| hit_rate@10 | 0.95 |
| MRR | 0.61 |
| Precision@5 | 95% relevant |
| Chunks indexed | 22,836 across 6 codebases |
| Monthly cost | $5/mo (Railway only) |
Failure Mitigations
COBOL Training Gap
COBOL is absent from embedding model training. Metadata headers ([FILE], [FUNCTION], [DIVISION]) inject natural-language anchors the model understands.
Chunk Boundary Splits
Language-aware parsers respect syntax boundaries (divisions, paragraphs, functions). No mid-statement splits.
Identifier Collisions
Common names like CALCULATE-TOTALS appear across codebases. File-path metadata filtering scopes retrieval to the target codebase.
LLM Hallucinations
Grounded prompts with explicit source citations. The LLM explains retrieved code — it does not invent code that was not found.
Parser Coverage
7 language-aware parsers with syntax-boundary-preserving chunking. Each parser extracts structural metadata (divisions, paragraphs, functions) used as embedding headers.
| Language | Parser | Technique |
|---|---|---|
| COBOL | Regex-based hierarchy | 95% coverage — divisions, sections, paragraphs, copybooks |
| C | tree-sitter AST | Functions, structs, includes, preprocessor directives |
| Fortran | Regex | Programs, subroutines, functions, modules |
| JCL | Regex | Jobs, steps, procedures, DD statements |
| PL/I | Regex | Procedures, blocks, declarations |
| RPG | Regex | Subroutines, calculations, specifications |
| Plaintext | Line-based | Fallback — fixed-size chunks with overlap |