Problem Understanding
Legacy COBOL systems power critical infrastructure — banking, insurance, government — but the engineers who wrote them are retiring. New teams inherit millions of lines of undocumented code and need to understand what it does before they can modernize it. Manual code archaeology is slow, error-prone, and expensive.
The COBOL Challenge
COBOL is effectively absent from modern embedding model training data. Standard code search techniques that work well for Python or JavaScript fail badly — the embedding model doesn't understand COBOL syntax, naming conventions, or structure. A query like “where are customer totals calculated” won't match PERFORM CALCULATE-CUSTOMER-TOTALS through semantic similarity alone.
The key insight was metadata headers. By prepending structural context — [FILE: accounts.cbl] [FUNCTION: CALCULATE-CUSTOMER-TOTALS] [DIVISION: PROCEDURE] — before embedding each chunk, the model can leverage the natural-language anchors it does understand. Combined with SPLADE sparse vectors that provide exact lexical matching, this hybrid approach recovers most of the retrieval quality lost to the training gap.
Why Custom RAG
I evaluated LangChain and LlamaIndex early on. Both add significant abstraction overhead for what turned out to be a straightforward pipeline: encode, search, rerank, generate. The core retrieval logic is ~100 lines. What needed custom attention was the language-specific parsing — 7 parsers that understand COBOL divisions, C function boundaries, Fortran subroutines, and JCL job steps. No framework provides that.
The custom pipeline also gave me full control over the embedding strategy. The metadata-header approach, SPLADE integration, and codebase-scoped filtering all required direct access to the indexing and retrieval logic — exactly the layers that frameworks abstract away.
Hybrid Search
Dense embeddings capture semantic meaning but miss exact identifiers. Sparse vectors (SPLADE) capture lexical matches but miss synonyms and paraphrases. Reciprocal Rank Fusion combines both rankings in a single Qdrant query — no external orchestration needed. This was the deciding factor for choosing Qdrant over alternatives that require manual fusion.
Trade-offs
- Regex parsers over tree-sitter: tree-sitter-cobol has 32 GitHub stars and is immature. Regex gives 95% coverage with zero external dependencies. C uses tree-sitter because the grammar is mature.
- Free-tier LLMs over GPT-4: Cerebras provides 1M tokens/day free at 2000+ tokens/sec. Quality is sufficient for code explanation — the retrieval pipeline does the heavy lifting.
- Single Qdrant collection: All 6 codebases share one collection with a
codebasefield for filtering. Simpler operations, single backup, one index to maintain. - Embedding cache by content hash: Unchanged chunks skip re-embedding on reingest. Saves API calls and time when only a few files change.
What I'd Build Next
A codebase-wide dependency graph that traces PERFORM chains and COPY statements across files — enabling impact analysis queries like “what breaks if I change this copybook?” Language support for CICS and DB2 embedded SQL. And a fine-tuned embedding model on a COBOL corpus to close the training gap without relying on metadata headers.