The primary regression benchmark uses 89 causal reasoning questions with 323 concepts across 6 categories:
finance, US politics, geopolitics, AI/tech, science, biorisk/biotech.
View dataset on GitHub.
Each question has 2-6 concepts typed as upstream (cause), downstream (effect), indicator (signal), or component (sub-event)
Concepts seeded from production corpus themes, not generated blindly
Reviewed by 3 independent reviewers (Codex, Gemini, ChatGPT) for retrieval fitness
45 candidates dropped for duplicate retrieval paths, weak corpus fit, or vague concepts
Published as search_frozen_regression when used for stable algorithm comparisons
Live Slice
The live slice measures the current production-style retrieval path against the current production corpus. It is intended for operational robustness, not for strict historical regression.
Published as search_live_slice
Must include a slice timestamp or version label
Must include corpus counts and per-platform breakdowns
Scores are not comparable across time without checking corpus drift
Snapshot Restore Frame
The benchmark-of-record should run against a restored Cloud SQL backup, not directly against the moving production corpus. This is the frame for matrix refreshes and website numbers.
Published as search_snapshot_restore
Must record backup IDs, restore instances, datasets, and bundle manifest
Extension + ads freeze from polybridge-extension
Oracle freezes from polybridge-prod
Judging
Each (concept, market) pair is judged by an LLM on a 3-point scale:
Relevant (1.0) - market directly measures or tracks the concept
Partial (0.5) - same broad domain, loosely connected
Irrelevant (0.0) - no meaningful causal connection
Judgments are cached by stable concept ID. Cache key is
concept_id::market_question, so concept rewording does not invalidate cache.
Metrics
Judged Coverage @5 (primary) - % of concepts with at least one relevant or partial result in top 5
MRR @5 - mean reciprocal rank of first relevant result
nDCG @5 - normalized discounted cumulative gain
Strict Coverage @5 - % matching by exact phrase (no LLM judge)
95% bootstrap confidence intervals (10K samples) at concept level
Per-concept-type breakdown: upstream vs downstream vs indicator vs component
Benchmark Discipline
Every published search run should declare its benchmark frame explicitly so humans and agents do not confuse algorithm changes with corpus drift.
Always record benchmark family, slice, and slice timestamp
Always record whether the entry is freeze-backed or live-slice
Always record corpus size and concept count
Always record the freeze tag, datasets, and bundle manifest for benchmark-of-record runs
Never replace the frozen benchmark with a live slice
Version new benchmark frames instead of mutating old ones
Reading the Results
Judged Coverage @5
The primary metric. "For what % of causal concepts did the system find at least one relevant market in the top 5?"
Higher is better. 90% means 10% of concepts returned no useful results.
This uses LLM-judged relevance (relevant or partial counts as a hit).
A relevant-only score (no partial credit) would be lower. Check the judgment
breakdown in the eval results for the split.
MRR (Mean Reciprocal Rank)
How high does the first relevant result rank? MRR of 1.0 means rank 1 every time.
MRR of 0.5 means rank 2 on average. Captures whether the system puts the best result
first, not just somewhere in the top 5.
If coverage is high but MRR is low, the system finds relevant markets but buries them.
nDCG@5
Ranking quality across all 5 positions, weighted by relevance. Rewards having
multiple relevant results and penalizes relevant results at lower ranks.
More nuanced than MRR since it considers the full result list, not just the first hit.
Strict Coverage
Phrase matching only, no LLM judge. "Does the market title literally contain the expected
keywords?" This will always be low for causal retrieval because the value is finding
non-obvious connections (e.g., "Fed rate cut" market for a "mortgage rates" concept).
Useful as a sanity check and for fast iteration without burning judge calls.
Known limitations
Corpus gaps: Some concepts (clinical trials, niche legislation) have no matching prediction market. Failures here are not retrieval bugs.
Noise markets: ~1,100 formulaic markets (FDV launches, stock strike variants) pollute the embedding space. ~6% of top-5 results are noise. See the filtering brief for details.
Partial credit: Judged coverage counts "partial" matches (same domain, loosely connected) as hits. This inflates the headline number vs a strict relevant-only score.
Corpus size matters: Compare entries with the same corpus size and benchmark family. A live-slice score and a frozen-regression score answer different questions.
What moves the number
Filtering noise markets from the corpus (estimated +3-4pp coverage)
Re-ranking a larger candidate set (e.g., top-20 down to top-5) with a cross-encoder
Query expansion to bridge semantic gaps (e.g., expand "monetary policy" to also search for "rate cut", "FOMC")
Better embeddings for the market corpus (structured metadata, not just the question title)
The benchmark leaderboard tracks which approaches actually work. Run the eval, publish, and compare.
Adding Results
All eval tooling lives in the search-eval repo.
Clone it, run the benchmark against a restored snapshot frame, then publish to this leaderboard with the freeze and bundle metadata attached. See the repo README for full setup instructions.
# From the search-eval repo root:
# 1. Create backups and restore temporary benchmark instances
TS="$(date -u +%Y%m%dT%H%M%SZ)"
LABEL="search-benchmark-${TS}"
EXT_RESTORE="polybridge-extension-bench-${TS}"
ORACLE_RESTORE="polybridge-prod-bench-${TS}"
# 2. Run the benchmark against those restored instances / temporary services
# (exact commands depend on the surface)
# 3. Write the freeze tag
.venv/bin/python scripts/create_benchmark_freeze_metadata.py \
--label "${LABEL}" \
--instance-frame "surface=extension_ads,source_instance=polybridge-extension,backup_id=${EXT_BACKUP_ID},restore_instance=${EXT_RESTORE}" \
--instance-frame "surface=oracle,source_instance=polybridge-prod,backup_id=${ORACLE_BACKUP_ID},restore_instance=${ORACLE_RESTORE}"
# 4. Create the benchmark bundle manifest
.venv/bin/python scripts/create_benchmark_bundle.py \
--label "${LABEL}" \
--benchmark-family search_snapshot_restore \
--benchmark-slice prod_frozen_restore \
--slice-timestamp "${TS}" \
--artifact <artifact-path> \
--dataset <dataset-path> \
--service extension-api \
--service oracle-api \
--freeze-metadata experiments/results/benchmark_freezes/${LABEL}.json \
--skip-snapshot
# 5. Push to this leaderboard with explicit frame metadata
.venv/bin/python scripts/publish_benchmark.py \
--label "My experiment name" \
--kind research \
--benchmark-family search_snapshot_restore \
--benchmark-slice prod_frozen_restore \
--freeze-metadata experiments/results/benchmark_freezes/${LABEL}.json \
--bundle-manifest data/benchmark_bundles/<bundle-dir>/manifest.json
Retrieval Overlap Analysis
How do vector search (Gemini embeddings) and keyword-extracted text search (Sonnet keywords) compare? Each method finds ~90% of concepts, but they miss different ones.
Vector-unique hits
Concepts found by embeddings but missed by keyword search. Typically abstract or indirect concepts where semantic similarity matters.
Keyword-unique hits
Concepts found by keyword search but missed by embeddings. Typically entity-specific concepts where exact name matching matters.