Polybridge Internal

Search Benchmark

Polybridge Internal

Search Benchmark

Causal concept retrieval evaluation with explicit separation between frozen regression runs and live-corpus slices.

Current Benchmark Snapshot

Benchmark portal Search comparison Graph benchmark Math benchmark Extension search search-eval repo

Methodology

Frozen Regression Frame

The primary regression benchmark uses 89 causal reasoning questions with 323 concepts across 6 categories: finance, US politics, geopolitics, AI/tech, science, biorisk/biotech. View dataset on GitHub.

Each question has 2-6 concepts typed as upstream (cause), downstream (effect), indicator (signal), or component (sub-event)
Concepts seeded from production corpus themes, not generated blindly
Reviewed by 3 independent reviewers (Codex, Gemini, ChatGPT) for retrieval fitness
45 candidates dropped for duplicate retrieval paths, weak corpus fit, or vague concepts
Published as search_frozen_regression when used for stable algorithm comparisons

Live Slice

The live slice measures the current production-style retrieval path against the current production corpus. It is intended for operational robustness, not for strict historical regression.

Published as search_live_slice
Must include a slice timestamp or version label
Must include corpus counts and per-platform breakdowns
Scores are not comparable across time without checking corpus drift

Snapshot Restore Frame

The benchmark-of-record should run against a restored Cloud SQL backup, not directly against the moving production corpus. This is the frame for matrix refreshes and website numbers.

Published as search_snapshot_restore
Must record backup IDs, restore instances, datasets, and bundle manifest
Extension + ads freeze from polybridge-extension
Oracle freezes from polybridge-prod

Judging

Each (concept, market) pair is judged by an LLM on a 3-point scale:

Relevant (1.0) - market directly measures or tracks the concept
Partial (0.5) - same broad domain, loosely connected
Irrelevant (0.0) - no meaningful causal connection

Judgments are cached by stable concept ID. Cache key is concept_id::market_question, so concept rewording does not invalidate cache.

Metrics

Judged Coverage @5 (primary) - % of concepts with at least one relevant or partial result in top 5
MRR @5 - mean reciprocal rank of first relevant result
nDCG @5 - normalized discounted cumulative gain
Strict Coverage @5 - % matching by exact phrase (no LLM judge)
95% bootstrap confidence intervals (10K samples) at concept level
Per-concept-type breakdown: upstream vs downstream vs indicator vs component

Benchmark Discipline

Every published search run should declare its benchmark frame explicitly so humans and agents do not confuse algorithm changes with corpus drift.

Always record benchmark family, slice, and slice timestamp
Always record whether the entry is freeze-backed or live-slice
Always record corpus size and concept count
Always record the freeze tag, datasets, and bundle manifest for benchmark-of-record runs
Never replace the frozen benchmark with a live slice
Version new benchmark frames instead of mutating old ones

Reading the Results

Judged Coverage @5

The primary metric. "For what % of causal concepts did the system find at least one relevant market in the top 5?" Higher is better. 90% means 10% of concepts returned no useful results.

This uses LLM-judged relevance (relevant or partial counts as a hit). A relevant-only score (no partial credit) would be lower. Check the judgment breakdown in the eval results for the split.

MRR (Mean Reciprocal Rank)

How high does the first relevant result rank? MRR of 1.0 means rank 1 every time. MRR of 0.5 means rank 2 on average. Captures whether the system puts the best result first, not just somewhere in the top 5.

If coverage is high but MRR is low, the system finds relevant markets but buries them.

nDCG@5

Ranking quality across all 5 positions, weighted by relevance. Rewards having multiple relevant results and penalizes relevant results at lower ranks. More nuanced than MRR since it considers the full result list, not just the first hit.

Strict Coverage

Phrase matching only, no LLM judge. "Does the market title literally contain the expected keywords?" This will always be low for causal retrieval because the value is finding non-obvious connections (e.g., "Fed rate cut" market for a "mortgage rates" concept).

Useful as a sanity check and for fast iteration without burning judge calls.

Known limitations

Corpus gaps: Some concepts (clinical trials, niche legislation) have no matching prediction market. Failures here are not retrieval bugs.
Noise markets: ~1,100 formulaic markets (FDV launches, stock strike variants) pollute the embedding space. ~6% of top-5 results are noise. See the filtering brief for details.
Partial credit: Judged coverage counts "partial" matches (same domain, loosely connected) as hits. This inflates the headline number vs a strict relevant-only score.
Corpus size matters: Compare entries with the same corpus size and benchmark family. A live-slice score and a frozen-regression score answer different questions.

What moves the number

Filtering noise markets from the corpus (estimated +3-4pp coverage)
Re-ranking a larger candidate set (e.g., top-20 down to top-5) with a cross-encoder
Query expansion to bridge semantic gaps (e.g., expand "monetary policy" to also search for "rate cut", "FOMC")
Better embeddings for the market corpus (structured metadata, not just the question title)

The benchmark leaderboard tracks which approaches actually work. Run the eval, publish, and compare.

Adding Results

All eval tooling lives in the search-eval repo. Clone it, run the benchmark against a restored snapshot frame, then publish to this leaderboard with the freeze and bundle metadata attached. See the repo README for full setup instructions.

# From the search-eval repo root:

# 1. Create backups and restore temporary benchmark instances
TS="$(date -u +%Y%m%dT%H%M%SZ)"
LABEL="search-benchmark-${TS}"
EXT_RESTORE="polybridge-extension-bench-${TS}"
ORACLE_RESTORE="polybridge-prod-bench-${TS}"

# 2. Run the benchmark against those restored instances / temporary services
#    (exact commands depend on the surface)

# 3. Write the freeze tag
.venv/bin/python scripts/create_benchmark_freeze_metadata.py \
  --label "${LABEL}" \
  --instance-frame "surface=extension_ads,source_instance=polybridge-extension,backup_id=${EXT_BACKUP_ID},restore_instance=${EXT_RESTORE}" \
  --instance-frame "surface=oracle,source_instance=polybridge-prod,backup_id=${ORACLE_BACKUP_ID},restore_instance=${ORACLE_RESTORE}"

# 4. Create the benchmark bundle manifest
.venv/bin/python scripts/create_benchmark_bundle.py \
  --label "${LABEL}" \
  --benchmark-family search_snapshot_restore \
  --benchmark-slice prod_frozen_restore \
  --slice-timestamp "${TS}" \
  --artifact <artifact-path> \
  --dataset <dataset-path> \
  --service extension-api \
  --service oracle-api \
  --freeze-metadata experiments/results/benchmark_freezes/${LABEL}.json \
  --skip-snapshot

# 5. Push to this leaderboard with explicit frame metadata
.venv/bin/python scripts/publish_benchmark.py \
  --label "My experiment name" \
  --kind research \
  --benchmark-family search_snapshot_restore \
  --benchmark-slice prod_frozen_restore \
  --freeze-metadata experiments/results/benchmark_freezes/${LABEL}.json \
  --bundle-manifest data/benchmark_bundles/<bundle-dir>/manifest.json

Label	Kind	System	Benchmark	Variant	Judged	Direct	Upstream	Downstream	Correlated	Mean	Date
Loading leaderboard...

Search Benchmark

Search Benchmark

Frozen Regression Frame

Live Slice

Snapshot Restore Frame

Judging

Metrics

Benchmark Discipline

Judged Coverage @5

MRR (Mean Reciprocal Rank)

nDCG@5

Strict Coverage

Known limitations

What moves the number

Vector-unique hits

Keyword-unique hits