Architecture

How LegacyLens turns a natural language question into a grounded, streaming answer from legacy code.

Pipeline

  User Query
        │
  ┌─────▼─────────┐
  │  Alias Norm.   │  Normalize legacy identifiers (COBOL paragraphs, copybooks)
  └─────┬─────────┘
        │
  ┌─────▼─────────┐
  │  Parallel      │  Dense: voyage-code-3 (1024d, 32K context)
  │  Encoding      │  Sparse: SPLADE via FastEmbed + code identifier tokenizer
  └─────┬─────────┘
        │
  ┌─────▼─────────┐
  │  Qdrant        │  Hybrid search with Reciprocal Rank Fusion (RRF)
  │  Hybrid Search │  Boolean metadata filtering by codebase + language
  └─────┬─────────┘
        │  top-20
  ┌─────▼─────────┐
  │  Cross-Encoder │  ms-marco-MiniLM-L-6-v2 reranker
  │  Rerank        │  Re-scores by semantic relevance to query
  └─────┬─────────┘
        │  top-5
  ┌─────▼─────────┐
  │  LLM           │  Cerebras Qwen3-Coder (primary, 2000+ t/s)
  │  Generation    │  Groq fallback, streaming via SSE
  └─────┬─────────┘
        │
  Streaming Response + Source Citations

Design Decisions

Custom Pipeline, not LangChain

~3K LOC custom RAG pipeline

LangChain adds 5 abstraction layers for what is ~100 lines of retrieval + generation glue. Custom code means full control over the embedding strategy, search tuning, and streaming behavior.

Qdrant with Native Hybrid Search

Qdrant over ChromaDB, pgvector, Pinecone

Only Qdrant supports dense + sparse vectors with built-in Reciprocal Rank Fusion in a single query. ChromaDB has no hybrid search, pgvector requires manual RRF, Pinecone is paid.

voyage-code-3 Embeddings

1024d, 32K context window

+13.8% code retrieval quality vs jina-v2. The 32K context window handles large COBOL paragraphs without truncation. 200M free tokens covers development and production.

SPLADE Sparse Vectors

SPLADE via FastEmbed + custom code identifier tokenizer

Dense embeddings miss exact identifier matches (PERFORM CALCULATE-TOTALS). SPLADE provides lexical precision where semantic search fails. Custom tokenizer handles COBOL naming conventions.

Metadata-Header Embedding Strategy

Prepend [FILE: x] [FUNCTION: y] [DIVISION: z] headers before embedding

COBOL is absent from embedding model training data. Headers inject natural-language anchors that bridge the vocabulary gap — the embedding model understands "FILE" and "FUNCTION" even if it has never seen COBOL.

Cerebras Primary, Groq Fallback

Cerebras Qwen3-Coder (1M tokens/day free, 2000+ t/s)

Cerebras is 3-4x faster than Groq for code generation with a generous free tier. Groq provides automatic fallback (30 RPM free). Both support streaming SSE.

Performance

Metric	Value
Retrieval latency	149–259ms
Full query (Cerebras)	539–1,325ms
hit_rate@10	0.95
MRR	0.61
Precision@5	95% relevant
Chunks indexed	22,836 across 6 codebases
Monthly cost	$5/mo (Railway only)

Failure Mitigations

COBOL Training Gap

COBOL is absent from embedding model training. Metadata headers ([FILE], [FUNCTION], [DIVISION]) inject natural-language anchors the model understands.

Chunk Boundary Splits

Language-aware parsers respect syntax boundaries (divisions, paragraphs, functions). No mid-statement splits.

Identifier Collisions

Common names like CALCULATE-TOTALS appear across codebases. File-path metadata filtering scopes retrieval to the target codebase.

LLM Hallucinations

Grounded prompts with explicit source citations. The LLM explains retrieved code — it does not invent code that was not found.

Parser Coverage

7 language-aware parsers with syntax-boundary-preserving chunking. Each parser extracts structural metadata (divisions, paragraphs, functions) used as embedding headers.

Language	Parser	Technique
COBOL	Regex-based hierarchy	95% coverage — divisions, sections, paragraphs, copybooks
C	tree-sitter AST	Functions, structs, includes, preprocessor directives
Fortran	Regex	Programs, subroutines, functions, modules
JCL	Regex	Jobs, steps, procedures, DD statements
PL/I	Regex	Procedures, blocks, declarations
RPG	Regex	Subroutines, calculations, specifications
Plaintext	Line-based	Fallback — fixed-size chunks with overlap