Product · Context Indexer

Tree-sitter + PageRank = reviewers that actually understand your codebase

Most AI code reviewers see only the diff hunks. They miss the function called three frames up, the convention used everywhere else in the repo, the same bug fix that was already shipped in another file. LGTM's context indexer parses your repo with tree-sitter, builds a symbol graph, ranks it with PageRank, and feeds the relevant slice to every review agent — so the agents reason about your codebase, not generic patterns.

Without context

The generic-review problem

Imagine an LLM reviewing this PR diff:

+ const user = await fetchUser(req.params.id);
+ res.json({ email: user.email });

Without context, the reviewer sees nothing concerning. Two lines of code. They'll suggest adding a try / catch and move on.

But fetchUser() in your codebase returns User | null from the cache layer — and a null deref on .email already caused a P1 incident last quarter. The reviewer missed it because it never looked at fetchUser.

With context

Same diff, indexed repo

The Bugs agent receives the diff + this:

// from src/lib/cache.ts (indexed)
export async function fetchUser(
  id: string
): Promise<User | null> {
  return cache.get(`user:${id}`) ?? null;
}

Now the verdict reads:

"fetchUser() can return null on cache miss (line 42 of cache.ts). The new code derefs .email without a guard — likely null-pointer error in production. Suggest: if (!user) return res.status(404).end()."

How indexing actually works

Three stages. First one runs when you click "Index Codebase". The other two keep the index fresh on every push, incrementally.

Stage 1

Parse

Tree-sitter AST extraction

We walk your repo with a per-language tree-sitter grammar. For each source file we extract: exported function names + signatures, type/class definitions, call sites (who calls whom), and imports (who depends on whom).

Tree-sitter is fast — about 10ms per file on a modest CPU. A 5,000-file monorepo finishes the parse stage in ~50 seconds. We never store full source bodies; only symbol names, signatures, and the line range they occupy.

Stage 2

Graph

Symbol-and-call graph

The parsed symbols feed into a directed graph: nodes are symbols (functions, types, modules), edges are relationships (calls, imports, extends). One repo typically yields 10k-100k nodes for a mid-sized codebase.

The graph captures the "who-uses-what" map of your code. When a PR touches fetchUser, we can instantly answer "what 23 places call this?" — without re-scanning the repo.

Stage 3

Rank

PageRank-personalized retrieval

Symbols touched by a PR get a high "seed" weight. We then run personalized PageRank — the same algorithm Google used to rank web pages — across the symbol graph. Symbols with strong call/import relationships to the seed bubble up.

The top-ranked symbols (typically 5-15 of them) get their signatures + bodies attached to the review agent prompts. PageRank beats vector embeddings here because code structure is graph-shaped, not semantically clustered — a function called from 50 places is contextually important even if it doesn't semantically "match" the diff text.

Supported languages

Twelve languages today. Adding a language requires the tree-sitter grammar plus a per-language query file mapping syntax nodes to symbol kinds — about a day of work each. Open a request for a missing language at tarinagarwal@gmail.com.

TypeScript

.ts .tsxfirst-class

JavaScript

.js .jsx .mjsfirst-class

Python

.py .pyifirst-class

Go

.gofirst-class

Rust

.rsfirst-class

Java

.javafirst-class

Ruby

.rbstable

C

.c .hstable

C++

.cpp .hpp .cc .hhstable

C#

.csstable

Swift

.swiftbeta

Kotlin

.kt .ktsbeta

First-class = full call-graph, convention extraction, and history integration. Stable = symbol extraction and call graph; convention extractor is light. Beta = symbol extraction only; expect rougher edges on metaprogramming-heavy code.

Two extractors that pull in unwritten rules

Beyond raw symbols, the indexer runs two domain-specific extractors that mine your repo for conventions and history. These feed the review agents the "how we do things here" signal that's otherwise impossible to capture.

Convention extractor

Scans the repo for recurring patterns and marks them as conventions:

  • ·Preferred HTTP client (fetch vs axios vs ky)
  • ·Error-handling pattern (throw vs Result vs callback)
  • ·Logging library + log-level conventions
  • ·Naming conventions for tests, mocks, types
  • ·Common helpers (formatDate, requireAuth, etc.)

If a PR reaches for moment.js in a repo that uses date-fns everywhere else, the reviewer flags it as a convention violation — not as a generic "moment.js is deprecated" comment.

History extractor

Summarizes recent PR descriptions (last 50 merged PRs by default) and feeds them into the review prompts. Catches two important signals:

  • ·Recurring themes ("we just refactored auth, be careful with session handling")
  • ·Recent incidents ("P1 from null deref on fetchUser")
  • ·Active migrations ("moving from Express to Hono — don't add new Express routes")
  • ·Style decisions ("we decided no class components in last week's review")

Only the summaries are stored — not the full PR descriptions or commit content. Re-ranked every 50 PRs so the signal stays fresh.

What stays out of the index

Indexing has access to your source code — that's the point. Here's exactly what we keep and what we don't.

We keep

  • Symbol names (function/class/type)
  • Type signatures (params + return)
  • Call relationships (A calls B)
  • Import graph (file → file dependencies)
  • File paths + line ranges
  • Summarized PR history (last 50 PRs, ~100 words each)

We never keep

  • ×Full source code bodies
  • ×Comments or docstrings
  • ×String/numeric literals
  • ×Secrets, env vars, config
  • ×Git commit messages or diffs
  • ×Anything from .gitignored paths

When a review agent looks up fetchUser for context, the function body is fetched FRESH from GitHub at that moment using the installation token, held in worker memory for the duration of the review, and discarded. The body never lives in our database.

Performance characteristics

Rough numbers from production. Times are wall-clock on shared-cpu-1x workers in Fly Singapore.

Repo sizeInitial indexIncremental pushIndex size
Small (< 100 files)~5s< 1s~50 KB
Medium (100-1,000)~30s1-3s~500 KB
Large (1k-5k)1-3 min2-5s~2-5 MB
Monorepo (5k-20k)3-8 min5-15s~10-30 MB
Huge (> 20k)contact usvariesvaries

Incremental updates run on every push to default branch. They're differential — only changed files get re-parsed, only affected graph edges get re-ranked. The full index never needs to be rebuilt unless you click "Re-index" manually.

Why PageRank, not vector embeddings

Most "context-aware" AI code tools use vector embeddings — chunk the code, embed each chunk, retrieve the top-k cosine-similar chunks at query time. That works for documentation and FAQs. It works less well for code.

Where embeddings fall short for code

  • Structural relationships are invisible. Two functions can be cosine-distant in embedding space but one literally calls the other.
  • Chunking destroys context. A function spanning 200 lines gets split across chunks. The chunk that gets retrieved might be the middle of the function — useless.
  • Repo-specific vocabulary doesn't embed well. Internal type names, custom helpers, project-specific jargon — embeddings trained on general code don't understand these.
  • Re-embedding on every push is expensive. Either you re-embed (~$0.10 per push on large repos) or you let the index go stale.

Why PageRank wins

  • Code is graph-shaped, not vector-shaped. Call sites, imports, inheritance — these are first-class relationships. PageRank operates on exactly that structure.
  • Importance scales with usage. A utility function called from 50 places ranks higher than an unused helper, even if they look identical to an embedding.
  • Personalized seeds. We can bias the random walker toward the PR's changed symbols, so context surfaces what's actually relevant to THIS PR.
  • Cheap incremental updates. Re-rank costs are linear in changed edges, not the full graph. Push doesn't pay per-token.

Monorepos work the same way

We treat your monorepo as one workspace. The graph spans all packages, so a review on packages/api can pull context from packages/types if there's a call/import relationship.

The PR-scoped retrieval keeps token spend in check — a review on a 100-package monorepo doesn't pull context from all 100 packages, only the ones the diff actually touches and the symbols those reach.

Per-package config (different focus areas per workspace, different model overrides per package) is on the roadmap. For now, per-repo settings cover everyone.

FAQ

Do I have to index every repo?

No. Indexing is opt-in per repo. Reviews work without an index — the agents just see the diff hunks. With an index, the agents see the diff plus the right context. Most users index their main product repo and skip indexing throwaway repos.

How fresh does the index stay?

Incremental updates run on every push to default branch. Branch-specific updates don't fire because branches are typically short-lived. If you push to main every hour, the index is at most one hour stale.

What happens if I disconnect a repo?

The full index for that repo gets deleted within 24 hours. We don't keep symbol graphs for disconnected repos. Audit log entries from past reviews stay (compliance) but they reference review IDs, not the index itself.

Can I see what's in my index?

Yes — the dashboard shows file count, convention count, and PR-history summary count per repo. We don't expose the raw symbol graph yet, but it's on the roadmap for users who want to audit what got captured.

Does indexing send my code to OpenAI / Anthropic / Gemini?

No. Indexing is entirely server-side; it's tree-sitter parsing + graph algorithms. No LLM involved. Only the review pipeline calls your AI provider, and only with the small subset of symbol bodies the agents actually need to reason about the diff.

What if my repo has a custom language / DSL?

We can't index it without a tree-sitter grammar. The review pipeline still works (just without context). tree-sitter has community grammars for hundreds of languages — open a request and we'll add it if there's a grammar available.

Will indexing slow down my pushes?

No — incremental update runs in a background BullMQ worker. Your push completes immediately; the index catches up asynchronously. Reviews triggered before the index catches up just use the previous-version context; the next review picks up the fresh index.

Index your repo. Get better reviews.

One click from the dashboard. Works in the background. Symbol-only — your source code never leaves GitHub.