Problem Understanding
ISV support engineers spend significant time manually correlating signals across Kubernetes support bundles — cross-referencing pod statuses with events, tracing ownership chains through deployments, checking whether services have endpoints, and hunting through logs for error patterns. These bundles are offline .tar.gz archives, so cloud-based observability tools don't help. The work is repetitive, structured, and a natural fit for automation.
Why a 4-Stage Pipeline
My first instinct was to consider RAG — embed the bundle contents, retrieve relevant chunks, and feed them to an LLM. But Kubernetes manifests are structured data with well-defined relationships, not natural language documents. Vector similarity is the wrong retrieval mechanism when you can deterministically traverse ownerReference chains and label selectors.
Context stuffing was the other obvious approach: just dump everything into Claude's context window. But even a modest cluster produces 500K+ tokens of YAML. That's beyond any context window, and even if it fit, the signal-to-noise ratio would be terrible.
The key insight is that most diagnostic information is extractable without an LLM at all. A CrashLoopBackOff is a CrashLoopBackOff — you don't need Claude to detect it. What you need Claude for is correlating multiple signals into a causal story: “the nginx pod is crash-looping because the ConfigMap it mounts doesn't exist, which is probably a deployment ordering issue.” That reasoning is what the LLM adds.
The Resource Graph
Kubernetes resources form a directed acyclic graph. Deployments own ReplicaSets which own Pods. Services select Pods by label. Pods mount PVCs which reference StorageClasses. Building this graph explicitly (9 relationship types, 5 diagnostic queries) enables cross-file correlation that manual analysis misses.
Deterministic-First Philosophy
The 7 pre-filter heuristics catch 60-70% of common issues without any LLM involvement. The --no-llm flag gives you this for free — useful in air-gapped environments or when you want instant triage before burning API credits.
The LLM layer adds explanation, causal reasoning, and remediation commands. It investigates hypotheses using 5 focused tools (file reading, file listing, log search, relationship traversal, event lookup). This tool-use approach gives Claude targeted access to bundle data without front-loading everything into the prompt.
Validation
Post-generation validation checks the LLM's output against the resource graph and bundle contents. If Claude references a resource that doesn't exist in the graph, or cites a file path not in the bundle, the validation layer flags it. This catches hallucinations before they reach the user.
Trade-offs
- Go over Python: Replicated's stack is Go. Single binary distribution, no runtime dependencies, fast startup for CLI tooling.
- Hand-crafted fixtures over Kind: Deterministic test data, faster dev cycle, no external dependencies.
- Anthropic SDK native over instructor-go: The official Go SDK has mature tool-use support. instructor-go adds unnecessary abstraction.
What I'd Build Next
An eval system that compares LLM findings against pre-filter ground truth. MCP server integration so bundlebot can be used as a tool within larger diagnostic workflows. And generated analyzers: when Claude identifies a pattern, auto-generate a Troubleshoot YAML analyzer that catches it deterministically in future bundles — extending Replicated's ecosystem rather than replacing it.