Fixture suite
This is the benchmark closest to the live graph engine today.
- Uses 8 saved production-style query fixtures
- Scores the final graph output against handwritten expectations
- Best proxy on this page for production regression risk
Sign in with your company Google account to access the graph benchmark page.
Dedicated graph benchmark page covering the current one-shot graph builder, the upstream event-grouping benchmark, and the newer structural reconstruction benchmarks for method research.
Benchmark page for three distinct graph benchmark families: the current LLM-backed graph builder over saved fixtures, an upstream event-grouping benchmark over frozen production search-result subsets, and newer snapshot-based structural reconstruction benchmarks over labeled event-pair datasets. The published feed may point at either the original small eval set or a larger expanded structural benchmark split. The grouping and reconstruction benchmarks are intentionally narrower and upstream: they measure components of the pipeline, not end-to-end production graph quality.
This is the benchmark closest to the live graph engine today.
This is an earlier-stage research benchmark for deterministic methods.
This is the benchmark for canonical event construction before edge proposal.
In the duplicate table, F1, precision, and recall measure whether a method predicts the same duplicate labels a human gave.
precision: when we say two events are duplicates, how often are we right?recall: of the true duplicate pairs, how many did we recover?F1: a single score balancing bothThe reconstruction benchmark is necessary, but not sufficient, for a production replacement.
We benchmark on saved real search_results payloads captured from production query flows, then replay the exact same inputs for each model.
AI regulation chip exportsEach fixture has a handwritten annotation. A run gets credit only for matching those expectations.
quality is mean passed-check fraction across fixtures. We also expose fixture pass rate and check pass rate to avoid hiding partial misses.
mean_quality_score = average checks passed / total checksfixture_pass_rate = share of fixtures with zero failed checksmedian_graph_wall_time_s and p95_graph_wall_time_s capture latencyA score of 1.000 means a model matched our current corpus annotations, not that it solved graph truth in general. The benchmark is only as strong as fixture coverage and annotation quality.
This is the existing production-style benchmark: saved query fixtures, annotated expected graphs, and provider/model comparisons for the LLM-backed builder. Click a row for run metadata and artifact links.
| Variant | Quality | Fixture pass | Check pass | Median latency | P95 latency |
|---|---|---|---|---|---|
| Loading graph leaderboard… | |||||
This benchmark isolates canonical event construction before any edge proposal. Each case contains frozen
production-style search_results, and the score reflects whether the production graph-service path
groups markets into the correct event partitions. Read this as upstream grouping quality, not end-to-end graph quality.
| Variant | Split | Pairwise F1 | Singleton accuracy | Cases | Updated |
|---|---|---|---|---|---|
| Loading grouping benchmark… | |||||
This is the new offline research benchmark. It asks how much graph structure we can recover from frozen snapshot data without relying on the full current LLM graph-construction path. The duplicate table measures pairwise duplicate recovery directly. The clustering table measures whether related pairs land together and unrelated pairs stay separated. Click a method row for threshold/count details and interpretation notes.
| Method | F1 | Precision | Recall | Evaluated | Updated |
|---|---|---|---|---|---|
| Loading duplicate benchmark… | |||||
| Method | Balance | Related Recall | None Separation | Evaluated | Updated |
|---|---|---|---|---|---|
| Loading clustering benchmark… | |||||