Benchmark: three-condition experiment¶
This page documents the three-condition experiment comparing knowledge graph types for AI coding assistant sessions. The question is not just "does a graph beat raw files?" but "do causal edge semantics beat structural edges at matched token budgets?"
The three conditions¶
| Condition A | Condition B | Condition C | |
|---|---|---|---|
| Name | Baseline | Structural KG | Causal KG (memoire) |
| Context given to Claude | All raw source files concatenated | Graph with IMPORTS / CALLS / INHERITS edges | Full graph + DRIVES / SPECIFIES / ASSERTS_ON edges |
| File reads | N (one per file) | 0 | 0 |
| Ingest cost | None | None | One-time LLM extraction per markdown file |
| Token cost | Scales with project size | Fixed (~2,000–3,000 tok) | Fixed (~7,000–9,000 tok) |
Questions¶
Three questions that require understanding relationships, not just file content:
- "What will break if I change the pandas version?" — dependency fan-in
- "What files do I need to touch to add a new forecast endpoint?" — causal chains
- "What are the riskiest files to modify in this project and why?" — centrality reasoning
Model choices¶
memoire uses the cheapest available model for all internal LLM tasks:
| Provider | Extraction model | Input $/MTok | Output $/MTok |
|---|---|---|---|
| Claude / Anthropic | claude-haiku-4-5 |
$0.80 | $4.00 |
| OpenAI | gpt-4o-mini |
$0.15 | $0.60 |
| Gemini | gemini-2.0-flash |
$0.10 | $0.40 |
| Ollama | local model | $0.00 | $0.00 |
How to run¶
# Install with dev dependencies
pip install -e ".[dev]"
# Run all three conditions (default)
python scripts/benchmark.py --project-root /path/to/your/project
# Save full results to JSON
python scripts/benchmark.py --project-root /path/to/your/project --output results.json
# Run a specific condition only
python scripts/benchmark.py --project-root /path/to/your/project --conditions structural causal
# Use a different session model
python scripts/benchmark.py --project-root /path/to/your/project \
--session-model claude-sonnet-4-6
# Use custom questions
python scripts/benchmark.py --project-root /path/to/your/project \
--questions "What is the highest-risk file?" "What depends on the auth module?"
Requirements: claude CLI must be installed and authenticated.
Results¶
Run against the testproject — a realistic S&P 500 forecasting service: 28 source files,
~19,000 tokens (FastAPI, ARIMA, Redis cache, PostgreSQL ORM, Prometheus metrics, auth,
scheduler, Streamlit dashboard, notifications, health checks, tests, docs).
| Question | A: Baseline | B: Structural | C: Causal | A→B | A→C |
|---|---|---|---|---|---|
| What will break if I change the pandas version? | 19,316 tok | 2,433 tok | 7,569 tok | −87.4% | −60.8% |
| What files to touch for a new forecast endpoint? | 19,245 tok | 2,370 tok | 7,583 tok | −87.7% | −60.6% |
| What are the riskiest files to modify? | 19,416 tok | 2,575 tok | 7,764 tok | −86.7% | −60.0% |
| Total tokens | 57,977 | 7,378 | 22,916 | −87.3% | −60.5% |
| A: Baseline | B: Structural | C: Causal (w/ ingest) | |
|---|---|---|---|
| Total cost | $0.04854 | $0.00766 | $0.02885 |
| Time (s) | 33.1 | 31.9 | 39.4 |
| File reads | 28 | 0 | 0 |
Session model: claude-haiku-4-5 ($0.80/MTok in, $4.00/MTok out)
Condition C ingest: claude-haiku-4-5, one-time cost $0.00875 (amortised per question above)
Token count method
Token counts use a 1 token ≈ 4 characters approximation. Costs are based on published pricing. Run the benchmark script on your own project for precise numbers.
What the results mean¶
B beats A on tokens by 87%. The structural graph (IMPORTS/CALLS/INHERITS) is extremely compact — 28 edges describing a 28-file project fit in ~2,400 tokens. Claude answers dependency questions correctly from this.
C beats A on tokens by 60%, but uses 3× more tokens than B.
The causal graph is larger because it carries 72 additional edges (DRIVES, SPECIFIES,
ASSERTS_ON) with rationales, observation counts, and side-effect metadata. That overhead
is the cost of richer semantics.
The scientific question — does C give better answers than B?
Token count does not measure answer quality. The key claim of memoire is that causal edges
produce ranked, rationale-bearing answers where structural edges produce correct but
unranked answers. For example:
- B (structural): "pandas is imported by
client.py,pipeline.py, andmodel.py." - C (causal): "
pipeline.pyis highest-risk — it has causal reachability 3, two network side effects, and DRIVESapp.pywhich is the API entry point."
This qualitative difference is not captured in the token benchmark above. A formal answer-quality evaluation (human raters or LLM-as-judge) is the next step.
Why savings grow with project size¶
| Project size | Baseline tokens | Structural tokens | Causal tokens | A→B | A→C |
|---|---|---|---|---|---|
| 28 files (code only) | ~58,000 | ~7,400 | ~23,000 | −87% | −60% |
| 55 files (code + PDFs + images) | ~349,000 | ~6,700 | ~23,300 | −98% | −93% |
| 100 files | ~200,000 | ~15,000 | ~28,000 | ~93% | ~86% |
| 200 files | ~400,000+ | ~20,000 | ~30,000 | ~95% | ~93% |
The graph size grows sublinearly — the top-100 edge cap in get_context() means the
causal context stabilises well below 10,000 tokens regardless of project size.
PDFs and images inflate the baseline dramatically (a 15-page paper = ~30,000 raw tokens)
while adding only a handful of graph edges.
Comparison with Graphify¶
Graphify is the most widely cited tool claiming token reduction for AI coding assistants. Both tools were run on the identical Karpathy corpus: 3 code repos (nanoGPT, minGPT, micrograd) + 5 academic PDFs + 4 images (55 files total).
| Graphify | memoire B (structural) | memoire C (causal) | |
|---|---|---|---|
| Token reduction vs raw | 71.5× | 52.3× | 14.9× |
| Total context tokens | ~4,900* | 6,669 | 23,346 |
| Baseline tokens | ~350,000 | 348,981 | 348,981 |
| Ingest cost | First run (LLM) | None — static analysis only | One-time LLM extraction |
| Persistent daemon | No | Yes | Yes |
| Causal edge semantics | No | No | Yes (DRIVES, SPECIFIES, ASSERTS_ON) |
| Graph method | NetworkX + Leiden community detection | Pattern-based static extraction | Static + LLM promotion |
*Graphify's output token count estimated from their 71.5× claim on ~350k baseline.
Key takeaways:
-
memoire structural (52×) vs Graphify (71.5×): The gap comes from Graphify's Leiden community detection, which collapses many related nodes into named clusters. memoire's structural graph preserves individual edges without lossy clustering.
-
memoire causal (15×) vs structural (52×): The causal graph is intentionally larger — it carries DRIVES/SPECIFIES/ASSERTS_ON edges with rationales, observation counts, and side-effect metadata. That semantic richness is the tradeoff for higher compression.
-
The compression question is secondary to answer quality. A graph that compresses to 5,000 tokens but loses the causal structure ("why does X break when Y changes?") is less useful than one at 23,000 tokens that preserves it. The three-condition experiment above exists precisely to measure whether causal edges improve answer quality beyond structural.
Methodology note
The benchmark questions (pandas version, forecast endpoint, riskiest files) were designed for the testproject and are not domain-relevant to the Karpathy ML corpus. Token counts are still valid as compression measurements. For a full quality comparison, domain-specific questions should be used with each corpus.
Running your own benchmark¶
# Code-only projects
memoire init --provider claude
memoire ingest
python scripts/benchmark.py --project-root . --output my_results.json
# Projects with PDFs and images (install pdf extras first)
pip install "memoire-ai[pdf]"
memoire ingest
python scripts/benchmark.py --project-root . --output my_results.json
Share results at github.com/athammad/memoire/discussions.