Polybridge Internal

Graph Benchmark

Polybridge Internal

Graph Benchmark

Dedicated graph benchmark page covering the current one-shot graph builder, the upstream event-grouping benchmark, and the newer structural reconstruction benchmarks for method research.

Graph Construction

Benchmark page for three distinct graph benchmark families: the current LLM-backed graph builder over saved fixtures, an upstream event-grouping benchmark over frozen production search-result subsets, and newer snapshot-based structural reconstruction benchmarks over labeled event-pair datasets. The published feed may point at either the original small eval set or a larger expanded structural benchmark split. The grouping and reconstruction benchmarks are intentionally narrower and upstream: they measure components of the pipeline, not end-to-end production graph quality.

Benchmark portal Search benchmark Math benchmark

Feed generated Loading…

Pillar status Loading…

Corpus size Current graph fixture corpus + event-grouping corpus + current structural benchmark corpus

How To Read This Page

Fixture suite

This is the benchmark closest to the live graph engine today.

Uses 8 saved production-style query fixtures
Scores the final graph output against handwritten expectations
Best proxy on this page for production regression risk

Structural reconstruction

This is an earlier-stage research benchmark for deterministic methods.

Uses the currently published labeled structural-event-pair corpus
Tests duplicate recovery and cluster quality first
Does not yet score full end-to-end graph construction on live-style queries

Event grouping

This is the benchmark for canonical event construction before edge proposal.

Uses curated synthetic and room-native search-result subsets
Scores whether markets collapse into the right canonical event groups
Does not score edge inference or final graph quality

What F1 means here

In the duplicate table, F1, precision, and recall measure whether a method predicts the same duplicate labels a human gave.

precision: when we say two events are duplicates, how often are we right?
recall: of the true duplicate pairs, how many did we recover?
F1: a single score balancing both

What is still missing

The reconstruction benchmark is necessary, but not sufficient, for a production replacement.

Grouping quality is now benchmarked separately from edge quality
Real deterministic edge proposal is not benchmarked end-to-end yet
Query-conditioned graph assembly is not covered yet
Strong reconstruction scores do not by themselves prove production readiness

Topline

Best quality Loading…

Fastest Loading…

Implemented baseline Loading…

Methodology

Corpus

We benchmark on saved real search_results payloads captured from production query flows, then replay the exact same inputs for each model.

Duplicate-collapse cases
Threshold families and multi-outcome groupings
Noisy mixed-policy cases like AI regulation chip exports
Sparse or no-edge cases where dense graphs are wrong

Quality scoring

Each fixture has a handwritten annotation. A run gets credit only for matching those expectations.

Expected target latent
Latent count range
Must-have and must-not-have edges
Must-have latent groupings
Expected warning substrings when applicable

Metrics

quality is mean passed-check fraction across fixtures. We also expose fixture pass rate and check pass rate to avoid hiding partial misses.

mean_quality_score = average checks passed / total checks
fixture_pass_rate = share of fixtures with zero failed checks
median_graph_wall_time_s and p95_graph_wall_time_s capture latency

Interpretation

A score of 1.000 means a model matched our current corpus annotations, not that it solved graph truth in general. The benchmark is only as strong as fixture coverage and annotation quality.

Good for regression detection
Good for provider/model comparisons
Not yet a universal causal-graph benchmark

Benchmark Notes

The current graph builder is treated as an LLM stand-in, not the final algorithm.
We prioritize conservative sparse graphs over dense noisy graphs.
Provider changes should be benchmark-driven, not taste-driven.
The event-grouping benchmark is upstream of edge proposal; a high grouping score does not by itself imply good causal edges.
The snapshot reconstruction benchmark is not directly comparable to the fixture-suite quality score; it measures structural recovery on a different corpus with different metrics.
Today, the reconstruction benchmark should be read as “can deterministic methods recover useful human-labeled structure?” not “can we ship this as the production graph engine now?”

Fixture Suite Leaderboard

This is the existing production-style benchmark: saved query fixtures, annotated expected graphs, and provider/model comparisons for the LLM-backed builder. Click a row for run metadata and artifact links.

Variant	Quality	Fixture pass	Check pass	Median latency	P95 latency
Loading graph leaderboard…

Event Grouping

This benchmark isolates canonical event construction before any edge proposal. Each case contains frozen production-style search_results, and the score reflects whether the production graph-service path groups markets into the correct event partitions. Read this as upstream grouping quality, not end-to-end graph quality.

Generated Loading…

Corpus Loading…

Best dev run Loading…

Best holdout run Loading…

Implemented baseline Loading…

Variant	Split	Pairwise F1	Singleton accuracy	Cases	Updated
Loading grouping benchmark…

Structural Reconstruction

This is the new offline research benchmark. It asks how much graph structure we can recover from frozen snapshot data without relying on the full current LLM graph-construction path. The duplicate table measures pairwise duplicate recovery directly. The clustering table measures whether related pairs land together and unrelated pairs stay separated. Click a method row for threshold/count details and interpretation notes.

Generated Loading…

Corpus Loading…

Best duplicate method Loading…

Best clustering method Loading…

Method	F1	Precision	Recall	Evaluated	Updated
Loading duplicate benchmark…

Method	Balance	Related Recall	None Separation	Evaluated	Updated
Loading clustering benchmark…