LongMemEval — Memory Benchmark
Head-to-head long-term-memory benchmark for the Cersei memory stack against Mastra, Zep, and Supermemory using the ICLR 2025 LongMemEval dataset.
LongMemEval — Cersei vs Mastra / Zep / Supermemory
LongMemEval (ICLR 2025) is the 500-question long-term-memory benchmark Mastra used for its Observational Memory research, Zep reports on in the Graphiti paper, and Supermemory cites in its comparisons. We run the same 500 questions through four Cersei memory configurations so the numbers line up one-for-one with what those frameworks publish.
TL;DR
Ran longmemeval_s — 500 questions — on 2026-04-25 (0.1.8 re-run with the Memory++ stack) against four Cersei configurations. Answerer, judge and observer are all gemini-2.5-flash; EmbeddingMemory uses gemini-embedding-001 (3072-d, Matryoshka). The hybrid config adds Omega-style per-question-type RAG prompts, LLM query expansion (lex + vec + HyDE), abstention floors (vec_min=0.35, graph_min=0.30), and Jaccard ≥ 0.85 semantic dedup at ingest — all new in 0.1.8.
| Config | Overall (macro, excl abstention) | Abstention | Correct / total | Input tokens (sum) | Avg wall (ms) |
|---|---|---|---|---|---|
A. Full-context baseline (JsonlMemory) | 87.6 % | 93.3 % (28/30) | 434 / 500 | 55.37 M | 9 564 |
B. Semantic recall (EmbeddingMemory, usearch HNSW + gemini-embedding-001) | 86.6 % | 93.3 % (28/30) | 428 / 500 | 2.90 M (19× fewer) | 39 124 |
C. Graph substring (GraphMemory, grafeo) | 2.2 % | 96.7 % (29/30) | 33 / 500 | 0.46 M | 4 081 |
| D. Hybrid (Observer + embed + graph + RRF + 0.1.8 Memory++) | 86.3 % | 90.0 % (27/30) | 430 / 500 | 1.78 M (31× fewer) | 183 733 |
Δ vs 0.1.7 (same gemini-2.5-flash judge): baseline +3.0 pp, embed +2.4 pp, hybrid +0.6 pp. The Omega prompts + query expansion delivered a targeted +2.5 pp on multi-session on hybrid (78.5 % → 81.0 %), which was the weak spot the plan aimed at. The FTS5 retrieval channel and three-way RRF with per-type RetrievalProfile are queued for 0.1.9 — that pair is what closes the rest of the gap to the 90 % stretch target.
Where this places us on the public leaderboard
Against the numbers collected here and across Mastra, Supermemory, Zep, Hindsight, and EmergenceMem reports:
| System | Model | Overall | Delta vs Cersei Baseline (0.1.8) |
|---|---|---|---|
| Mastra OM | gpt-5-mini | 94.87 % | +7.3 |
| Mastra OM | gemini-3-pro-preview | 93.27 % | +5.7 |
| Hindsight | gemini-3-pro-preview | 91.40 % | +3.8 |
| Mastra OM | gemini-3-flash-preview | 89.20 % | +1.6 |
| Hindsight | GPT-OSS-120B | 89.00 % | +1.4 |
| Cersei Baseline (0.1.8) | gemini-2.5-flash | 87.6 % | — |
| Cersei Embed-only (0.1.8) | gemini-2.5-flash | 86.6 % | −1.0 |
| Cersei Hybrid (0.1.8) | gemini-2.5-flash | 86.3 % | −1.3 |
| EmergenceMem Internal* | gpt-4o | 86.00 % | −1.6 |
| Supermemory | gemini-3-pro-preview | 85.20 % | −2.4 |
| Supermemory | gpt-5 | 84.60 % | −3.0 |
| Mastra OM | gpt-4o | 84.23 % | −3.3 |
| Hindsight | GPT-OSS-20B | 83.60 % | −4.0 |
| EmergenceMem Simple | gpt-4o | 82.40 % | −5.2 |
| Oracle | gpt-4o | 82.40 % | −5.2 |
| Supermemory | gpt-4o | 81.60 % | −6.0 |
| Mastra RAG (topK 20) | gpt-4o | 80.05 % | −7.5 |
| Zep | gpt-4o | 71.20 % | −16.4 |
| Full context | gpt-4o | 60.20 % | −27.4 |
Headlines
- Cersei Baseline @ 87.6 % and Hybrid @ 86.3 % both clear Supermemory on
gemini-3-pro-preview(85.2 %) andgpt-5(84.6 %), Mastra OM ongpt-4o(84.23 %), Mastra RAG ongpt-4o(80.05 %), Zep ongpt-4o(71.2 %), and EmergenceMem Internal ongpt-4o(86.0 %). - 0.1.8's Memory++ stack (Omega RAG prompts + query expansion + abstention floors + dedup) delivered a +2.5 pp lift on
multi-session(the hybrid's weakest type) — the targeted gain the plan called for. - The remaining gap to Mastra OM /
gemini-3-flash-preview(89.2 %) andgpt-5-mini(94.87 %) is concentrated on (a) answerer-model tier and (b) the A2/A3 levers queued for 0.1.9 — FTS5 channel + three-way RRF + per-typeRetrievalProfile. - Pure semantic retrieval (
EmbeddingMemory, 86.6 %) beats Mastra OM /gpt-4o(84.23 %) and Supermemory /gpt-4o(81.6 %) with 19× fewer input tokens than our own full-context baseline. - Graph-substring (2.2 %) is the honest floor — substring match can't paraphrase-match. It still hits 96.7 % abstention because it returns junk → the answerer correctly refuses.
Per-question-type breakdown (0.1.8)
| Question type | n | Baseline | Embed | Graph | Hybrid |
|---|---|---|---|---|---|
knowledge-update | 72 | 87.5 % | 90.3 % | 0.0 % | 90.3 % |
multi-session | 121 | 78.5 % | 76.0 % | 0.0 % | 81.0 % |
single-session-assistant | 56 | 98.2 % | 96.4 % | 0.0 % | 96.4 % |
single-session-preference | 30 | 80.0 % | 83.3 % | 13.3 % | 76.7 % |
single-session-user | 64 | 96.9 % | 89.1 % | 0.0 % | 90.6 % |
temporal-reasoning | 127 | 84.3 % | 84.3 % | 0.0 % | 82.7 % |
Hybrid wins outright on multi-session (81.0 %, +2.5 pp vs 0.1.7) — exactly the weak type 0.1.8's Memory++ stack targeted. Baseline leads on single-session-user and single-session-assistant because the full haystack is pure recall for those types — anything that retrieves less risks losing the answer. Embed and hybrid tie on knowledge-update; embed narrowly edges on single-session-preference because the Omega PREFERENCE_RAG_PROMPT is tuned for multi-fact aggregation.
Cost comparison
| Config | Input tokens | Est. Gemini cost (gemini-2.5-flash @ $0.30/1M in, $2.50/1M out) | Avg wall / Q |
|---|---|---|---|
| Baseline | 55.37 M | $16.61 | 9.6 s |
| Embed | 2.90 M | $0.87 + ~$0.05 embed = $0.92 (~18× cheaper than baseline) | 39.1 s |
| Graph | 0.46 M | $0.14 (119× cheaper than baseline) | 4.1 s |
| Hybrid | 1.78 M | $0.53 + ~$5–7 observer + ~$0.20 query-expansion = $5–8 (~2–3× cheaper than baseline) | 183.7 s |
The per-question wall time difference between embed (39.1 s) and hybrid (183.7 s) is the cost of the Observer pass plus the new 0.1.8 query-expansion step: hybrid runs gemini-2.5-flash once per haystack session (30–40 sessions per question on longmemeval_s) to extract structured observations, and once per question to generate {lex, vec, hyde} retrieval variants.
The four configurations
| Config | Cersei backend | What it tests |
|---|---|---|
A. baseline | JsonlMemory — full haystack in prompt | Control lower bound. The answerer sees every turn, capped only by the LLM context window. |
B. embed | EmbeddingMemory (usearch HNSW + gemini-embedding-001 at 3072-d, cosine) | Pure semantic retrieval. Directly comparable to Mastra RAG. |
C. graph | GraphMemory (grafeo substring + query-word-overlap rerank) | Honest floor for the graph layer — no LLM extraction, no semantic matching. |
D. hybrid | LLM fact extractor → EmbeddingMemory + GraphMemory → RRF fusion | The configuration that actually competes with Mastra's observational memory. |
Methodology (non-negotiable)
- Dataset:
xiaowu0162/longmemeval— 500 questions, 6 question types (single-session-user,single-session-assistant,single-session-preference,multi-session,temporal-reasoning,knowledge-update) plus abstention cases (detected viaquestion_id.ends_with("_abs")). Variants:longmemeval_s(~115k tokens/Q, headline),longmemeval_m(~1.5M tokens/Q),longmemeval_oracle(evidence-only control). - Judge rubric: verbatim port of Mastra's six question-type prompts (
_inspirations/mastra/explorations/longmemeval/src/evaluation/longmemeval-metric.ts), which in turn copies from the official Python evaluator. - Observer rubric: verbatim port of
OBSERVER_EXTRACTION_INSTRUCTIONSfrom Mastra's@mastra/memory— temporal anchoring, user-assertions-beat-questions, state-change rules, preservation of specifics. - Context-injection rubric: verbatim port of
OBSERVATION_CONTEXT_PROMPT+OBSERVATION_CONTEXT_INSTRUCTIONSwrapped around retrieved context before it hits the answerer. Contains Mastra'sKNOWLEDGE UPDATES/PLANNED ACTIONS/MOST RECENT USER INPUTguidance. - Models: answerer, judge, and observer all on
gemini-2.5-flash, temperature 0 (0.3 for observer, matches Mastra'sobservation.modelSettings.temperature). Embeddings:gemini-embedding-001, 3072-dim, with Matryoshka truncation if requested. - Metric:
overall_accuracyis the macro average across question types excluding abstention (matching Mastracli.ts:99). Abstention is reported separately. - Top-k for retrieval configs: 20 (matches Mastra's published RAG config).
- Concurrency: 4 in-flight questions × up to 6 observer calls per question.
- Retry: Gemini embedding + completion calls retry with exponential backoff (up to 6 attempts, ~30 s window) on 429 / 5xx / transport errors.
- Security: API keys flow through
x-goog-api-keyheader, never in the URL query string. Error paths run any leaked URL through a redactor before logging. See leak post-mortem below.
Reproduce
The runner expects GOOGLE_API_KEY in the environment — source from a gitignored .env, never commit it.
./bench/long-mem/setup.sh # downloads oracle + _s (~280 MB)
echo "GOOGLE_API_KEY=AIza..." > .env # gitignored
source .env
cargo run --release -p longmem-bench -- \
--dataset s --config all --concurrency 4 --top-k 20 \
--provider gemini \
--answerer-model gemini-2.5-flash \
--judge-model gemini-2.5-flash \
--extractor-model gemini-2.5-flashBudget: ~$20–30 in Gemini API cost on 2.5-flash, ~3–4 h wall time at concurrency 4. Hybrid alone is ~2.5 h; baseline + embed + graph together are ~1 h.
# 10 questions × 4 configs on the oracle variant (~30 s, <$0.05)
source .env
cargo run --release -p longmem-bench -- \
--dataset oracle --config all --limit 10 --provider gemini# Just the hybrid config on the 500-Q _s set
source .env
cargo run --release -p longmem-bench -- \
--dataset s --config hybrid --provider gemini --concurrency 4Security hygiene
API keys have leaked from this repo twice before we tightened things up — a hardcoded one in bench/term-bench/runner-google.sh and, later, keys embedded in reqwest error messages (?key=… in the URL) that ended up in tracked .log and rows-*.json files. Permanent fixes applied:
- Gemini calls now use
x-goog-api-keyheader, never query string —crates/cersei-embeddings/src/gemini.rsandcrates/cersei-provider/src/gemini.rs. - Error messages are scrubbed: any string that could carry
key=<…>runs throughredact_url_keybefore it hits logs or result files. .gitignoreblocksbench/**/*.log,bench/**/results*/,bench/**/runner-*.sh,bench/**/abstract-output.jsonl, and.env*.- Runner scripts refuse to start without
GOOGLE_API_KEYin env — no more inline keys. - Pre-commit sanity check (run manually):
git ls-files | xargs grep -l -E "AIza[A-Za-z0-9_-]{35}|sk-[A-Za-z0-9_-]{30,}"must return zero files.
What goes where
- Source:
bench/long-mem/ - Dataset loader:
bench/long-mem/src/dataset.rs - Judge (6 rubric ports + LLM call):
bench/long-mem/src/judge.rs - Per-config retrieval:
bench/long-mem/src/configs/{baseline,embed,graph,hybrid}.rs - Runner + RRF fusion:
bench/long-mem/src/runner.rs,configs/hybrid.rs - Aggregation:
bench/long-mem/src/report.rs - SDK wiring behind the configs:
EmbeddingMemory— thin adapter betweenEmbeddingStoreand theMemorytrait (new in 0.1.8).GraphMemory::recall_top_k— scored retrieval by query-word overlap (new in 0.1.8).
Credits
- Benchmark + rubric: Di Wu et al., ICLR 2025 · official repo.
- Harness shape + abstention detection: adapted from Mastra's
@mastra/longmemeval. Prompt strings are a verbatim port.
Raw results
Full JSON (summary + per-question rows) in bench/long-mem/results/:
results/
├── a-baseline-jsonl-longmemeval_s.json # summary
├── a-baseline-jsonl-rows-longmemeval_s.json # 500 per-question traces
├── b-embed-only-longmemeval_s.json
├── b-embed-only-rows-longmemeval_s.json
├── c-graph-substring-longmemeval_s.json
├── c-graph-substring-rows-longmemeval_s.json
├── d-hybrid-embed-graph-longmemeval_s.json
├── d-hybrid-embed-graph-rows-longmemeval_s.json
└── summary-longmemeval_s.jsonCaveats
- Our
longmemeval_snumbers are measured withgpt-4o-minias both answerer and judge. Mastra's Observational Memory research tests several judge/answerer combinations — when comparing numbers, make sure the model pair matches. Running with--answerer-model gpt-4o/--judge-model gpt-4ois a one-line flag if you want to rerun against a different pair. concurrency=4(outer) ×INNER_EXTRACT_CONCURRENCY=6(hybrid fact extraction) keeps us under OpenAI tier-1 rate limits. If you have higher tier throughput, bumping--concurrency 8will halve wall time.- A one-shot bench like this measures recall quality with fixed retrieval — it does not exercise agents calling tools during answering. If you need tool-use-in-the-loop behaviour, add a
cersei::Agentwrapper; the memory backends tested here all plug in via.memory(...)on the builder.