← Back to LegacyLens

Architecture

How LegacyLens turns a natural language question into a grounded, streaming answer from legacy code.

Pipeline

  User Query
        │
  ┌─────▼─────────┐
  │  Alias Norm.   │  Normalize legacy identifiers (COBOL paragraphs, copybooks)
  └─────┬─────────┘
        │
  ┌─────▼─────────┐
  │  Parallel      │  Dense: voyage-code-3 (1024d, 32K context)
  │  Encoding      │  Sparse: SPLADE via FastEmbed + code identifier tokenizer
  └─────┬─────────┘
        │
  ┌─────▼─────────┐
  │  Qdrant        │  Hybrid search with Reciprocal Rank Fusion (RRF)
  │  Hybrid Search │  Boolean metadata filtering by codebase + language
  └─────┬─────────┘
        │  top-20
  ┌─────▼─────────┐
  │  Cross-Encoder │  ms-marco-MiniLM-L-6-v2 reranker
  │  Rerank        │  Re-scores by semantic relevance to query
  └─────┬─────────┘
        │  top-5
  ┌─────▼─────────┐
  │  LLM           │  Cerebras Qwen3-Coder (primary, 2000+ t/s)
  │  Generation    │  Groq fallback, streaming via SSE
  └─────┬─────────┘
        │
  Streaming Response + Source Citations

Design Decisions

Custom Pipeline, not LangChain

~3K LOC custom RAG pipeline

LangChain adds 5 abstraction layers for what is ~100 lines of retrieval + generation glue. Custom code means full control over the embedding strategy, search tuning, and streaming behavior.

Qdrant with Native Hybrid Search

Qdrant over ChromaDB, pgvector, Pinecone

Only Qdrant supports dense + sparse vectors with built-in Reciprocal Rank Fusion in a single query. ChromaDB has no hybrid search, pgvector requires manual RRF, Pinecone is paid.

voyage-code-3 Embeddings

1024d, 32K context window

+13.8% code retrieval quality vs jina-v2. The 32K context window handles large COBOL paragraphs without truncation. 200M free tokens covers development and production.

SPLADE Sparse Vectors

SPLADE via FastEmbed + custom code identifier tokenizer

Dense embeddings miss exact identifier matches (PERFORM CALCULATE-TOTALS). SPLADE provides lexical precision where semantic search fails. Custom tokenizer handles COBOL naming conventions.

Metadata-Header Embedding Strategy

Prepend [FILE: x] [FUNCTION: y] [DIVISION: z] headers before embedding

COBOL is absent from embedding model training data. Headers inject natural-language anchors that bridge the vocabulary gap — the embedding model understands "FILE" and "FUNCTION" even if it has never seen COBOL.

Cerebras Primary, Groq Fallback

Cerebras Qwen3-Coder (1M tokens/day free, 2000+ t/s)

Cerebras is 3-4x faster than Groq for code generation with a generous free tier. Groq provides automatic fallback (30 RPM free). Both support streaming SSE.

Performance

MetricValue
Retrieval latency149–259ms
Full query (Cerebras)539–1,325ms
hit_rate@100.95
MRR0.61
Precision@595% relevant
Chunks indexed22,836 across 6 codebases
Monthly cost$5/mo (Railway only)

Failure Mitigations

COBOL Training Gap

COBOL is absent from embedding model training. Metadata headers ([FILE], [FUNCTION], [DIVISION]) inject natural-language anchors the model understands.

Chunk Boundary Splits

Language-aware parsers respect syntax boundaries (divisions, paragraphs, functions). No mid-statement splits.

Identifier Collisions

Common names like CALCULATE-TOTALS appear across codebases. File-path metadata filtering scopes retrieval to the target codebase.

LLM Hallucinations

Grounded prompts with explicit source citations. The LLM explains retrieved code — it does not invent code that was not found.

Parser Coverage

7 language-aware parsers with syntax-boundary-preserving chunking. Each parser extracts structural metadata (divisions, paragraphs, functions) used as embedding headers.

LanguageParserTechnique
COBOLRegex-based hierarchy95% coverage — divisions, sections, paragraphs, copybooks
Ctree-sitter ASTFunctions, structs, includes, preprocessor directives
FortranRegexPrograms, subroutines, functions, modules
JCLRegexJobs, steps, procedures, DD statements
PL/IRegexProcedures, blocks, declarations
RPGRegexSubroutines, calculations, specifications
PlaintextLine-basedFallback — fixed-size chunks with overlap