Benchmark: three-condition experiment¶

This page documents the three-condition experiment comparing knowledge graph types for AI coding assistant sessions. The question is not just "does a graph beat raw files?" but "do causal edge semantics beat structural edges at matched token budgets?"

The three conditions¶

	Condition A	Condition B	Condition C
Name	Baseline	Structural KG	Causal KG (memoire)
Context given to Claude	All raw source files concatenated	Graph with IMPORTS / CALLS / INHERITS edges	Full graph + DRIVES / SPECIFIES / ASSERTS_ON edges
File reads	N (one per file)	0	0
Ingest cost	None	None	One-time LLM extraction per markdown file
Token cost	Scales with project size	Fixed (~2,000–3,000 tok)	Fixed (~7,000–9,000 tok)

Questions¶

Three questions that require understanding relationships, not just file content:

"What will break if I change the pandas version?" — dependency fan-in
"What files do I need to touch to add a new forecast endpoint?" — causal chains
"What are the riskiest files to modify in this project and why?" — centrality reasoning

Model choices¶

memoire uses the cheapest available model for all internal LLM tasks:

Provider	Extraction model	Input $/MTok	Output $/MTok
Claude / Anthropic	`claude-haiku-4-5`	$0.80	$4.00
OpenAI	`gpt-4o-mini`	$0.15	$0.60
Gemini	`gemini-2.0-flash`	$0.10	$0.40
Ollama	local model	$0.00	$0.00

How to run¶

# Install with dev dependencies
pip install -e ".[dev]"

# Run all three conditions (default)
python scripts/benchmark.py --project-root /path/to/your/project

# Save full results to JSON
python scripts/benchmark.py --project-root /path/to/your/project --output results.json

# Run a specific condition only
python scripts/benchmark.py --project-root /path/to/your/project --conditions structural causal

# Use a different session model
python scripts/benchmark.py --project-root /path/to/your/project \
  --session-model claude-sonnet-4-6

# Use custom questions
python scripts/benchmark.py --project-root /path/to/your/project \
  --questions "What is the highest-risk file?" "What depends on the auth module?"

Requirements: claude CLI must be installed and authenticated.

Results¶

Run against the testproject — a realistic S&P 500 forecasting service: 28 source files, ~19,000 tokens (FastAPI, ARIMA, Redis cache, PostgreSQL ORM, Prometheus metrics, auth, scheduler, Streamlit dashboard, notifications, health checks, tests, docs).

Question	A: Baseline	B: Structural	C: Causal	A→B	A→C
What will break if I change the pandas version?	19,316 tok	2,433 tok	7,569 tok	−87.4%	−60.8%
What files to touch for a new forecast endpoint?	19,245 tok	2,370 tok	7,583 tok	−87.7%	−60.6%
What are the riskiest files to modify?	19,416 tok	2,575 tok	7,764 tok	−86.7%	−60.0%
Total tokens	57,977	7,378	22,916	−87.3%	−60.5%

	A: Baseline	B: Structural	C: Causal (w/ ingest)
Total cost	$0.04854	$0.00766	$0.02885
Time (s)	33.1	31.9	39.4
File reads	28	0	0

Session model: claude-haiku-4-5 ($0.80/MTok in, $4.00/MTok out)
Condition C ingest: claude-haiku-4-5, one-time cost $0.00875 (amortised per question above)

Token count method

Token counts use a 1 token ≈ 4 characters approximation. Costs are based on published pricing. Run the benchmark script on your own project for precise numbers.

What the results mean¶

B beats A on tokens by 87%. The structural graph (IMPORTS/CALLS/INHERITS) is extremely compact — 28 edges describing a 28-file project fit in ~2,400 tokens. Claude answers dependency questions correctly from this.

C beats A on tokens by 60%, but uses 3× more tokens than B.
The causal graph is larger because it carries 72 additional edges (DRIVES, SPECIFIES, ASSERTS_ON) with rationales, observation counts, and side-effect metadata. That overhead is the cost of richer semantics.

The scientific question — does C give better answers than B?
Token count does not measure answer quality. The key claim of memoire is that causal edges produce ranked, rationale-bearing answers where structural edges produce correct but unranked answers. For example:

B (structural): "pandas is imported by client.py, pipeline.py, and model.py."
C (causal): "pipeline.py is highest-risk — it has causal reachability 3, two network side effects, and DRIVES app.py which is the API entry point."

This qualitative difference is not captured in the token benchmark above. A formal answer-quality evaluation (human raters or LLM-as-judge) is the next step.

Why savings grow with project size¶

Project size	Baseline tokens	Structural tokens	Causal tokens	A→B	A→C
28 files (code only)	~58,000	~7,400	~23,000	−87%	−60%
55 files (code + PDFs + images)	~349,000	~6,700	~23,300	−98%	−93%
100 files	~200,000	~15,000	~28,000	~93%	~86%
200 files	~400,000+	~20,000	~30,000	~95%	~93%

The graph size grows sublinearly — the top-100 edge cap in get_context() means the causal context stabilises well below 10,000 tokens regardless of project size. PDFs and images inflate the baseline dramatically (a 15-page paper = ~30,000 raw tokens) while adding only a handful of graph edges.

Comparison with Graphify¶

Graphify is the most widely cited tool claiming token reduction for AI coding assistants. Both tools were run on the identical Karpathy corpus: 3 code repos (nanoGPT, minGPT, micrograd) + 5 academic PDFs + 4 images (55 files total).

	Graphify	memoire B (structural)	memoire C (causal)
Token reduction vs raw	71.5×	52.3×	14.9×
Total context tokens	~4,900*	6,669	23,346
Baseline tokens	~350,000	348,981	348,981
Ingest cost	First run (LLM)	None — static analysis only	One-time LLM extraction
Persistent daemon	No	Yes	Yes
Causal edge semantics	No	No	Yes (DRIVES, SPECIFIES, ASSERTS_ON)
Graph method	NetworkX + Leiden community detection	Pattern-based static extraction	Static + LLM promotion

*Graphify's output token count estimated from their 71.5× claim on ~350k baseline.

Key takeaways:

memoire structural (52×) vs Graphify (71.5×): The gap comes from Graphify's Leiden community detection, which collapses many related nodes into named clusters. memoire's structural graph preserves individual edges without lossy clustering.
memoire causal (15×) vs structural (52×): The causal graph is intentionally larger — it carries DRIVES/SPECIFIES/ASSERTS_ON edges with rationales, observation counts, and side-effect metadata. That semantic richness is the tradeoff for higher compression.
The compression question is secondary to answer quality. A graph that compresses to 5,000 tokens but loses the causal structure ("why does X break when Y changes?") is less useful than one at 23,000 tokens that preserves it. The three-condition experiment above exists precisely to measure whether causal edges improve answer quality beyond structural.

Methodology note

The benchmark questions (pandas version, forecast endpoint, riskiest files) were designed for the testproject and are not domain-relevant to the Karpathy ML corpus. Token counts are still valid as compression measurements. For a full quality comparison, domain-specific questions should be used with each corpus.

Running your own benchmark¶

# Code-only projects
memoire init --provider claude
memoire ingest
python scripts/benchmark.py --project-root . --output my_results.json

# Projects with PDFs and images (install pdf extras first)
pip install "memoire-ai[pdf]"
memoire ingest
python scripts/benchmark.py --project-root . --output my_results.json

Share results at github.com/athammad/memoire/discussions.