Cersei

Cersei vs Agno / LangGraph / CrewAI / PydanticAI

Five-axis benchmark of Cersei against the Python agent frameworks — instantiation, per-agent memory, max concurrent agents, graph-memory recall, semantic search.

General Agent Framework Benchmark

Cersei started as a Rust SDK for coding agents. With graph memory, in-process semantic search via cersei-embeddings, sub-agent orchestration, hooks, and a full permission system, it now competes directly with the general-purpose Python agent frameworks — Agno, LangGraph, CrewAI, PydanticAI.

The thing those frameworks' benchmarks don't measure is the production bottleneck that actually matters once you leave a notebook: how many live agent instances can you hold on one host before p99 latency or memory blows up. That is where Rust + native structs + zero-overhead Arc sharing should decide the fight, and this page is the scoreboard.

Methodology Every harness constructs the real agent — Agent(model=OpenAIChat(id="gpt-4o"), tools=[...]) for Agno, create_react_agent(ChatOpenAI(model="gpt-4o"), ...) for LangGraph, and so on — with no network call (model constructors are lazy in every framework measured here). We deliberately do NOT invoke a turn. Agno's own cookbook at cookbook/09_evals/performance/ takes the same position: the inference cost is identical across frameworks and measuring it muddies what we actually care about — the framework's own overhead. The axis-3 ramp measures pure concurrent construction capacity: how many agents can be built and held live before RSS or latency blows up. Full reproduction script at bench/general-agents/run.sh.


Axis 1 — Instantiation Time

μs to construct one ready-to-use Agent with one tool attached, measured over 1000 samples (after 100 warmup).

All five numbers below were measured on Apple M1 Pro via the same harness suite at bench/general-agents/.

FrameworkVersionInstantiation p50Ratio vs Cersei
Cersei0.1.6-patch.27.12 μs
Agno2.5.176.50 μs0.9×
PydanticAI1.22.0219.12 μs31×
LangGraph1.1.85 536.17 μs777×
CrewAI1.14.228 508.83 μs4 004×

Cersei lands in the same order of magnitude as Agno despite carrying a batteries-included agent (memory manager, graph backend, hook chain, permission policy, cost tracker, cancel token, broadcast channel). Agno's class is genuinely lean; everything else is slower by a lot. Cersei is 31× faster than PydanticAI, 777× faster than LangGraph, 4,000× faster than CrewAI.

The Agent::builder().build() call path is pure struct + Arc allocation — no network, no token counting, no provider auth. First LLM request pays the provider init cost once; it is not charged to instantiation.


Axis 2 — Per-Agent Memory

Bytes per agent, held over 1000 live instantiations. Cersei uses jemalloc::stats::allocated (delta / N); the Python frameworks use tracemalloc. This is the most honest cross-language comparison we can run — both count real bytes allocated by the framework.

FrameworkPer-agent memoryRatio vs Cersei
Cersei704 B
Agno5.8 KiB (5 938 B)8.4×
PydanticAI8.7 KiB (8 892 B)12.6×
CrewAI17.7 KiB (18 157 B)25.8×
LangGraph30.2 KiB (30 910 B)44×

This is the headline win. One Cersei agent fits 8× smaller than the leanest Python framework and 44× smaller than LangGraph. On a 4 GB process that difference is "tens of thousands of concurrent agents" vs "a few thousand before the GC starts thrashing".


Axis 3 — Max Concurrent Agents

The production question: how many agents can you build and hold live on one host before RSS or per-construction latency falls apart?

Cersei ramp (measured, Apple M1 Pro, host process, no cgroup cap). Each row spawns N tokio::spawn tasks, each builds one agent, all N are held in a Vec until the step completes — then RSS is read:

N live agentsp50 per-buildp99 per-buildRSS (total)Wall to build all N
1000.05 ms0.21 ms8.3 MB1.0 ms
5000.05 ms0.13 ms8.5 MB4.4 ms
1 0000.05 ms0.13 ms9.3 MB8.5 ms
5 0000.06 ms0.14 ms14.0 MB42.3 ms
10 0000.06 ms0.16 ms22.4 MB86.6 ms

Ten thousand concurrent Cersei agents, all constructed and held live, in 87 ms wall-clock on 22 MB of RSS total. That is ~1.4 KB incremental RSS per live agent (so tight only because the provider, permission policy, and cost tracker are shared via Arc), with p99 per-build construction latency staying under 200 μs at 10k concurrency.

And the Python frameworks at the same concurrency

Each Python framework was sampled at N=100 and N=500 on the same box. Pushing further on CrewAI in particular would take 10+ minutes per step, so we stop where the slope is obvious:

Framework@ N=100 — RSS / wall@ N=500 — RSS / wall
Cersei8.3 MB / 1.0 ms8.5 MB / 4.4 ms
Agno79.3 MB / 7.7 ms82.0 MB / 13.5 ms
PydanticAI122.0 MB / 28.9 ms123.2 MB / 125.0 ms
LangGraph193.5 MB / 361.3 ms193.5 MB / 2 246.5 ms
CrewAI1 739.3 MB / 11 628.5 ms1 739.3 MB / 50 697.4 ms

Reading that CrewAI row: building 500 CrewAI agents takes 50 seconds wall-clock and blows to 1.7 GB RSS. Cersei builds the same 500 in 4.4 ms on 8.5 MB. That's 11,500× the wall time and 204× the RSS for the same capacity. This is why "can I just spawn one agent per customer session" is not a question you can actually say yes to in Python agent frameworks today.

Reproduce this table yourself:

cd bench/general-agents
./run.sh
cat results/summary.json

Axis 4 — Graph Memory Recall Under Load (Cersei-only)

The Python frameworks don't ship an in-process graph database at all. Cersei does — Grafeo, schema-versioned, file-backed — and exposes it through cersei-memory::MemoryManager.

Under 10 concurrent readers × 100 recalls each, 10 000 nodes (macOS, Apple M1 Pro):

MetricValue
p5094 ms
p95120 ms
p99139 ms

Under higher concurrency (100+ simultaneous readers) recall serializes behind Grafeo's read path, inflating p50. Single-reader recall is ~98 μs — see the graph-memory benchmark for that baseline. Tuning the concurrent-read path is tracked as work-in-progress; we publish what we measure.


Axis 5 — Semantic Search Under Load (Cersei-only)

Same comparison: the Python agent frameworks don't ship HNSW semantic search in-process. Cersei does, via cersei-embeddings.

Under 50 concurrent agents × 100 queries each, 10 000 embedded chunks (cosine, 64-d vectors):

MetricValue
p5051 μs
p95132 μs
p99354 μs

Semantic search scales cleanly under concurrency because usearch uses lock-free HNSW internally — fifty concurrent agents each issuing 100 queries complete with sub-millisecond p95.


Reproduce the numbers

Python deps are managed with uv — install via curl -LsSf https://astral.sh/uv/install.sh | sh. Each Python framework lives in its own venv extra via pyproject.toml so Agno's deps never collide with LangGraph's.

# Cersei (host)
cargo run --release -p cersei-agent --example general_agent_bench --features bench-full

# Everything, host processes via uv (default)
cd bench/general-agents
./run.sh

# Per-framework
./run.sh --only cersei
./run.sh --only agno
uv run --extra pydantic_ai python bench_pydantic_ai.py

# Opt-in: Docker with a cgroup cap (for the max-concurrent-under-4GB axis)
./run.sh --docker
BENCH_MEM_CAP=16g ./run.sh --docker

Every measurement lands as JSON in bench/general-agents/results/<framework>.json matching a shared schema. aggregate.py merges them into summary.json.

Axis selection and scale tunables:

# Skip the slow graph/semantic axes during iteration
CERSEI_BENCH_AXES=1,2,3 cargo run --release ...

# Push graph/semantic scales up to 100k
CERSEI_BENCH_GRAPH_MAX=100000 CERSEI_BENCH_SEMANTIC_MAX=100000 cargo run --release ...

On this page