Cersei

Benchmarks: vs Claude Code and Codex

Three-way comparison — Abstract vs Claude Code vs Codex CLI across startup, memory, throughput, and graph recall.

Abstract vs Claude Code vs Codex CLI

Three-way comparison. Claude Code numbers from run_tool_bench_claude.sh --full. Codex numbers from run_tool_bench_codex.sh --full. All using the respective tool's non-interactive mode (claude -p, codex exec).

Claude Code v2.0.76 (Bun/JS, Anthropic Max plan). Codex CLI v0.118.0 (Node.js/Rust hybrid, OpenAI). Abstract v0.1.0 (Rust, OpenAI gpt-4o).


Infrastructure

MetricAbstractClaude CodeCodex CLI
Startup22ms266ms57ms
Binary / package6.0 MB174 MB~15 MB
Peak RSS4.7 MB333 MB44.7 MB
--help latency20ms263ms57ms
Tool dispatch (Read)0.09ms~265ms (fork)

Abstract is a single static Rust binary. Claude Code bundles the Bun runtime. Codex uses Node.js with a Rust sandbox component. Codex is significantly lighter than Claude Code but still 9.4x heavier than Abstract.


Memory

The largest gap across all three tools. Both Claude Code and Codex use LLM calls for memory operations. Abstract uses an embedded graph database.

OperationAbstractClaude CodeCodex CLI
Memory recall (agent)98us (graph)7545ms (Sonnet)5751ms (GPT)
Memory write (agent)28us (graph)20687ms5882ms
Memory recall (file I/O)1.3ms (text)17.5ms (grep)
MEMORY.md load9.6us17.1ms
File scan (100 files)1.2ms26.6ms
Session parse (20K lines)~53ms378.7ms

Claude Code calls Sonnet every turn to rank which 5 memory files are relevant (~7.5 seconds). Codex runs the full agent pipeline for memory operations (~5.8 seconds). Abstract's graph does indexed lookups in 98 microseconds — no LLM call, no API cost.


Agentic Throughput

End-to-end prompt-to-response latency. Abstract and Codex both use OpenAI models. Claude Code uses Anthropic Opus via Max plan.

MetricAbstractClaude CodeCodex CLI
Simple prompt ("say OK")2122ms8942ms3843ms
Sequential (10 prompts)1564ms/req12079ms/req4152ms/req

The throughput gap between Abstract and Codex (2.7x) is purely framework overhead — both hit the same OpenAI API. The gap between Codex and Claude Code (2.9x) includes both framework overhead and provider latency differences.


Token Consumption

FactorAbstractClaude CodeCodex CLI
System prompt~2200 tokens~8000+ tokens~10000+ tokens
Tool definitions34 tools~40 tools~30 tools
"say OK" total tokens10180
LLM call for recallNoYes (Sonnet)Yes (GPT)
Per-turn memory overhead12us~7500ms~5800ms

Codex used 10180 tokens for a 2-word response. The bulk is system prompt, tool definitions, and workspace context that Codex sends every turn.


Summary

CategoryAbstractClaude CodeCodex CLI
Startup22ms266ms (12x)57ms (2.6x)
RSS4.7 MB333 MB (71x)44.7 MB (9.5x)
Simple prompt2122ms8942ms (4.2x)3843ms (1.8x)
Throughput1564ms/req12079ms/req (7.7x)4152ms/req (2.7x)
Memory recall98us7545ms5751ms
Memory write28us20687ms5882ms
Graph memoryYesNoNo
LLM for recallNoYesYes

Ratios in parentheses are relative to Abstract.


Reproduce

# vs Claude Code
./run_tool_bench_claude.sh --iterations 20 --full

# vs Codex CLI
./run_tool_bench_codex.sh --iterations 20 --full

# Memory architecture
cargo run --release -p abstract-cli --example memory_bench

Full report: crates/abstract-cli/benchmarks/REPORT.md

On this page