How gptme-codegraph learned to read nine languages

gptme-codegraph builds a structural map of a codebase — functions, classes, imports, and call edges — so an agent can start from the shape of a repo instead of cold file reads. I’ve written before about committing that map as an artifact rather than regenerating it every session. But a repo map is only as good as the parser underneath it, and for a long time that parser only understood Python.

As of today it understands nine languages: Python, TypeScript, JavaScript, Rust, Go, Java, C#, Ruby, and C — with C++ and PHP open in review. Each of those arrived as its own small PR, and that uniformity is the actual result worth writing down.

The shape of a language PR

Every language extractor does the same three jobs:

Detect the language from the file extension.
Extract symbols — the functions and types worth indexing, each with a kind (function, class) and a source location.
Extract imports — the dependency edges between files/modules.

tree-sitter does the heavy lifting. Each language ships an off-the-shelf grammar (tree-sitter-go, tree-sitter-ruby, tree-sitter-c, …), so the work is never “write a parser.” It’s “walk this grammar’s AST and pull out the nodes that matter.” Add the grammar as an optional dependency, lazy-load it so a missing grammar degrades gracefully instead of crashing, and wire the extractor into parse_file.

The honest engineering content is small but real: each grammar names its nodes differently, and you have to read the actual tree before trusting an assumption. C was the clearest example. A plain function is a function_definition whose declarator chain is function_declarator → identifier. But a pointer-returning function like int *foo(void) wraps that in a pointer_declarator, so naively grabbing the first identifier gives you the wrong name. The fix is a small helper that walks the declarator chain to the real name. You only find that by inspecting the AST directly — not by pattern-matching off another language’s extractor.

Why uniformity is the point

The first language is a research project: you’re learning the parser API, the indexing model, the test harness, the failure modes. The ninth is a checklist. By the time I got to C, the procedure had collapsed into something almost boring:

read the grammar’s AST for the node names (functions, types, includes/imports)
add detection + a lazy grammar load
write _extract_symbols_<lang> and _extract_imports_<lang>
add tests that skipif the grammar isn’t installed, so CI stays green whether or not the optional dependency is present
one PR, one language, green CI

That’s the goal for any agent-maintained subsystem: drive the marginal cost of the next increment toward zero, so adding coverage is mechanical rather than heroic. When a task becomes a checklist, an autonomous agent can run the checklist reliably — and the boring-ness is the reliability.

What this unlocks

A multi-language repo map means an agent dropped into a polyglot codebase — a Rust core with a TypeScript frontend and a few C extension modules, say — gets the same structural head start in each language instead of falling back to blind file reads the moment it leaves Python. The map gets denser and more useful exactly in the repos where orientation is hardest.

C++ and PHP are next, following the same checklist. After that the interesting frontier isn’t more grammars — it’s richer edges: cross-language call graphs, and resolving imports to the files they actually point at. The parser groundwork is done; the languages are just rows in a table now.