Head-to-head long-term-memory benchmark for the Cersei memory stack against Mastra, Zep, and Supermemory using the ICLR 2025 LongMemEval dataset.

Supermemory

LongMemEval (ICLR 2025) is the 500-question long-term-memory benchmark Mastra used for its Observational Memory research, Zep reports on in the Graphiti paper, and Supermemory cites in its comparisons. We run the same 500 questions through four Cersei memory configurations so the numbers line up one-for-one with what those frameworks publish.

TL;DR

Ran longmemeval_s — 500 questions — on 2026-04-25 (0.1.8 re-run with the Memory++ stack) against four Cersei configurations. Answerer, judge and observer are all gemini-2.5-flash; EmbeddingMemory uses gemini-embedding-001 (3072-d, Matryoshka). The hybrid config adds Omega-style per-question-type RAG prompts, LLM query expansion (lex + vec + HyDE), abstention floors (vec_min=0.35, graph_min=0.30), and Jaccard ≥ 0.85 semantic dedup at ingest — all new in 0.1.8.

Config	Overall (macro, excl abstention)	Abstention	Correct / total	Input tokens (sum)	Avg wall (ms)
A. Full-context baseline (`JsonlMemory`)	87.6 %	93.3 % (28/30)	434 / 500	55.37 M	9 564
B. Semantic recall (`EmbeddingMemory`, usearch HNSW + gemini-embedding-001)	86.6 %	93.3 % (28/30)	428 / 500	2.90 M (19× fewer)	39 124
C. Graph substring (`GraphMemory`, grafeo)	2.2 %	96.7 % (29/30)	33 / 500	0.46 M	4 081
D. Hybrid (Observer + embed + graph + RRF + 0.1.8 Memory++)	86.3 %	90.0 % (27/30)	430 / 500	1.78 M (31× fewer)	183 733

Δ vs 0.1.7 (same gemini-2.5-flash judge): baseline +3.0 pp, embed +2.4 pp, hybrid +0.6 pp. The Omega prompts + query expansion delivered a targeted +2.5 pp on multi-session on hybrid (78.5 % → 81.0 %), which was the weak spot the plan aimed at. The FTS5 retrieval channel and three-way RRF with per-type RetrievalProfile are queued for 0.1.9 — that pair is what closes the rest of the gap to the 90 % stretch target.

Where this places us on the public leaderboard

Against the numbers collected here and across Mastra, Supermemory, Zep, Hindsight, and EmergenceMem reports:

System	Model	Overall	Delta vs Cersei Baseline (0.1.8)
Mastra OM	`gpt-5-mini`	94.87 %	+7.3
Mastra OM	`gemini-3-pro-preview`	93.27 %	+5.7
Hindsight	`gemini-3-pro-preview`	91.40 %	+3.8
Mastra OM	`gemini-3-flash-preview`	89.20 %	+1.6
Hindsight	`GPT-OSS-120B`	89.00 %	+1.4
Cersei Baseline (0.1.8)	`gemini-2.5-flash`	87.6 %	—
Cersei Embed-only (0.1.8)	`gemini-2.5-flash`	86.6 %	−1.0
Cersei Hybrid (0.1.8)	`gemini-2.5-flash`	86.3 %	−1.3
EmergenceMem Internal*	`gpt-4o`	86.00 %	−1.6
Supermemory	`gemini-3-pro-preview`	85.20 %	−2.4
Supermemory	`gpt-5`	84.60 %	−3.0
Mastra OM	`gpt-4o`	84.23 %	−3.3
Hindsight	`GPT-OSS-20B`	83.60 %	−4.0
EmergenceMem Simple	`gpt-4o`	82.40 %	−5.2
Oracle	`gpt-4o`	82.40 %	−5.2
Supermemory	`gpt-4o`	81.60 %	−6.0
Mastra RAG (topK 20)	`gpt-4o`	80.05 %	−7.5
Zep	`gpt-4o`	71.20 %	−16.4
Full context	`gpt-4o`	60.20 %	−27.4

Headlines

Cersei Baseline @ 87.6 % and Hybrid @ 86.3 % both clear Supermemory on gemini-3-pro-preview (85.2 %) and gpt-5 (84.6 %), Mastra OM on gpt-4o (84.23 %), Mastra RAG on gpt-4o (80.05 %), Zep on gpt-4o (71.2 %), and EmergenceMem Internal on gpt-4o (86.0 %).
0.1.8's Memory++ stack (Omega RAG prompts + query expansion + abstention floors + dedup) delivered a +2.5 pp lift on multi-session (the hybrid's weakest type) — the targeted gain the plan called for.
The remaining gap to Mastra OM / gemini-3-flash-preview (89.2 %) and gpt-5-mini (94.87 %) is concentrated on (a) answerer-model tier and (b) the A2/A3 levers queued for 0.1.9 — FTS5 channel + three-way RRF + per-type RetrievalProfile.
Pure semantic retrieval (EmbeddingMemory, 86.6 %) beats Mastra OM / gpt-4o (84.23 %) and Supermemory / gpt-4o (81.6 %) with 19× fewer input tokens than our own full-context baseline.
Graph-substring (2.2 %) is the honest floor — substring match can't paraphrase-match. It still hits 96.7 % abstention because it returns junk → the answerer correctly refuses.

Per-question-type breakdown (0.1.8)

Question type	n	Baseline	Embed	Graph	Hybrid
`knowledge-update`	72	87.5 %	90.3 %	0.0 %	90.3 %
`multi-session`	121	78.5 %	76.0 %	0.0 %	81.0 %
`single-session-assistant`	56	98.2 %	96.4 %	0.0 %	96.4 %
`single-session-preference`	30	80.0 %	83.3 %	13.3 %	76.7 %
`single-session-user`	64	96.9 %	89.1 %	0.0 %	90.6 %
`temporal-reasoning`	127	84.3 %	84.3 %	0.0 %	82.7 %

Hybrid wins outright on multi-session (81.0 %, +2.5 pp vs 0.1.7) — exactly the weak type 0.1.8's Memory++ stack targeted. Baseline leads on single-session-user and single-session-assistant because the full haystack is pure recall for those types — anything that retrieves less risks losing the answer. Embed and hybrid tie on knowledge-update; embed narrowly edges on single-session-preference because the Omega PREFERENCE_RAG_PROMPT is tuned for multi-fact aggregation.

Cost comparison

Config	Input tokens	Est. Gemini cost (`gemini-2.5-flash` @ $0.30/1M in, $2.50/1M out)	Avg wall / Q
Baseline	55.37 M	$16.61	9.6 s
Embed	2.90 M	$0.87 + ~$0.05 embed = $0.92 (~18× cheaper than baseline)	39.1 s
Graph	0.46 M	$0.14 (119× cheaper than baseline)	4.1 s
Hybrid	1.78 M	$0.53 + ~$5–7 observer + ~$0.20 query-expansion = $5–8 (~2–3× cheaper than baseline)	183.7 s

The per-question wall time difference between embed (39.1 s) and hybrid (183.7 s) is the cost of the Observer pass plus the new 0.1.8 query-expansion step: hybrid runs gemini-2.5-flash once per haystack session (30–40 sessions per question on longmemeval_s) to extract structured observations, and once per question to generate {lex, vec, hyde} retrieval variants.

The four configurations

Config	Cersei backend	What it tests
A. `baseline`	`JsonlMemory` — full haystack in prompt	Control lower bound. The answerer sees every turn, capped only by the LLM context window.
B. `embed`	`EmbeddingMemory` (`usearch` HNSW + `gemini-embedding-001` at 3072-d, cosine)	Pure semantic retrieval. Directly comparable to Mastra RAG.
C. `graph`	`GraphMemory` (grafeo substring + query-word-overlap rerank)	Honest floor for the graph layer — no LLM extraction, no semantic matching.
D. `hybrid`	LLM fact extractor → EmbeddingMemory + GraphMemory → RRF fusion	The configuration that actually competes with Mastra's observational memory.

Methodology (non-negotiable)

Dataset: xiaowu0162/longmemeval — 500 questions, 6 question types (single-session-user, single-session-assistant, single-session-preference, multi-session, temporal-reasoning, knowledge-update) plus abstention cases (detected via question_id.ends_with("_abs")). Variants: longmemeval_s (~115k tokens/Q, headline), longmemeval_m (~1.5M tokens/Q), longmemeval_oracle (evidence-only control).
Judge rubric: verbatim port of Mastra's six question-type prompts (_inspirations/mastra/explorations/longmemeval/src/evaluation/longmemeval-metric.ts), which in turn copies from the official Python evaluator.
Observer rubric: verbatim port of OBSERVER_EXTRACTION_INSTRUCTIONS from Mastra's @mastra/memory — temporal anchoring, user-assertions-beat-questions, state-change rules, preservation of specifics.
Context-injection rubric: verbatim port of OBSERVATION_CONTEXT_PROMPT + OBSERVATION_CONTEXT_INSTRUCTIONS wrapped around retrieved context before it hits the answerer. Contains Mastra's KNOWLEDGE UPDATES / PLANNED ACTIONS / MOST RECENT USER INPUT guidance.
Models: answerer, judge, and observer all on gemini-2.5-flash, temperature 0 (0.3 for observer, matches Mastra's observation.modelSettings.temperature). Embeddings: gemini-embedding-001, 3072-dim, with Matryoshka truncation if requested.
Metric: overall_accuracy is the macro average across question types excluding abstention (matching Mastra cli.ts:99). Abstention is reported separately.
Top-k for retrieval configs: 20 (matches Mastra's published RAG config).
Concurrency: 4 in-flight questions × up to 6 observer calls per question.
Retry: Gemini embedding + completion calls retry with exponential backoff (up to 6 attempts, ~30 s window) on 429 / 5xx / transport errors.
Security: API keys flow through x-goog-api-key header, never in the URL query string. Error paths run any leaked URL through a redactor before logging. See leak post-mortem below.

Reproduce

The runner expects GOOGLE_API_KEY in the environment — source from a gitignored .env, never commit it.

./bench/long-mem/setup.sh              # downloads oracle + _s (~280 MB)
echo "GOOGLE_API_KEY=AIza..." > .env   # gitignored
source .env
cargo run --release -p longmem-bench -- \
  --dataset s --config all --concurrency 4 --top-k 20 \
  --provider gemini \
  --answerer-model gemini-2.5-flash \
  --judge-model gemini-2.5-flash \
  --extractor-model gemini-2.5-flash

Budget: ~$20–30 in Gemini API cost on 2.5-flash, ~3–4 h wall time at concurrency 4. Hybrid alone is ~2.5 h; baseline + embed + graph together are ~1 h.

# 10 questions × 4 configs on the oracle variant (~30 s, <$0.05)
source .env
cargo run --release -p longmem-bench -- \
  --dataset oracle --config all --limit 10 --provider gemini

# Just the hybrid config on the 500-Q _s set
source .env
cargo run --release -p longmem-bench -- \
  --dataset s --config hybrid --provider gemini --concurrency 4

Security hygiene

API keys have leaked from this repo twice before we tightened things up — a hardcoded one in bench/term-bench/runner-google.sh and, later, keys embedded in reqwest error messages (?key=… in the URL) that ended up in tracked .log and rows-*.json files. Permanent fixes applied:

Gemini calls now use x-goog-api-key header, never query string — crates/cersei-embeddings/src/gemini.rs and crates/cersei-provider/src/gemini.rs.
Error messages are scrubbed: any string that could carry key=<…> runs through redact_url_key before it hits logs or result files.
.gitignore blocks bench/**/*.log, bench/**/results*/, bench/**/runner-*.sh, bench/**/abstract-output.jsonl, and .env*.
Runner scripts refuse to start without GOOGLE_API_KEY in env — no more inline keys.
Pre-commit sanity check (run manually): git ls-files | xargs grep -l -E "AIza[A-Za-z0-9_-]{35}|sk-[A-Za-z0-9_-]{30,}" must return zero files.

What goes where

Source: bench/long-mem/
Dataset loader: bench/long-mem/src/dataset.rs
Judge (6 rubric ports + LLM call): bench/long-mem/src/judge.rs
Per-config retrieval: bench/long-mem/src/configs/{baseline,embed,graph,hybrid}.rs
Runner + RRF fusion: bench/long-mem/src/runner.rs, configs/hybrid.rs
Aggregation: bench/long-mem/src/report.rs
SDK wiring behind the configs:
- EmbeddingMemory — thin adapter between EmbeddingStore and the Memory trait (new in 0.1.8).
- GraphMemory::recall_top_k — scored retrieval by query-word overlap (new in 0.1.8).

Credits

Benchmark + rubric: Di Wu et al., ICLR 2025 · official repo.
Harness shape + abstention detection: adapted from Mastra's @mastra/longmemeval. Prompt strings are a verbatim port.

Raw results

Full JSON (summary + per-question rows) in bench/long-mem/results/:

results/
├── a-baseline-jsonl-longmemeval_s.json          # summary
├── a-baseline-jsonl-rows-longmemeval_s.json     # 500 per-question traces
├── b-embed-only-longmemeval_s.json
├── b-embed-only-rows-longmemeval_s.json
├── c-graph-substring-longmemeval_s.json
├── c-graph-substring-rows-longmemeval_s.json
├── d-hybrid-embed-graph-longmemeval_s.json
├── d-hybrid-embed-graph-rows-longmemeval_s.json
└── summary-longmemeval_s.json

Caveats

Our longmemeval_s numbers are measured with gpt-4o-mini as both answerer and judge. Mastra's Observational Memory research tests several judge/answerer combinations — when comparing numbers, make sure the model pair matches. Running with --answerer-model gpt-4o / --judge-model gpt-4o is a one-line flag if you want to rerun against a different pair.
concurrency=4 (outer) × INNER_EXTRACT_CONCURRENCY=6 (hybrid fact extraction) keeps us under OpenAI tier-1 rate limits. If you have higher tier throughput, bumping --concurrency 8 will halve wall time.
A one-shot bench like this measures recall quality with fixed retrieval — it does not exercise agents calling tools during answering. If you need tool-use-in-the-loop behaviour, add a cersei::Agent wrapper; the memory backends tested here all plug in via .memory(...) on the builder.

LongMemEval — Memory Benchmark