Below is a handoff-ready dataset-page audit checklist, explicitly aligned to the Data Extraction metric.

This is designed so any team member can run an audit in 5–10 minutes per page.

Purpose: identify why a dataset page does or does not lead to successful data extraction.

Use as Yes / No / Needs work.


Section 1: Relevance (answers “Is this the dataset I need?”)

Above the fold, can a user confirm relevance in ≤10 seconds?

  • Clear, specific title (what + where + when)
  • One-sentence description stating the problem this dataset solves
  • Explicit coverage shown:
    • Geography
    • Time range
    • Granularity
  • License / price clearly visible (free vs paid vs mixed)

If any item fails → extraction probability is low.


Section 2: Extraction Clarity (answers “Can I get something usable?”)

Before clicking, is the extraction outcome obvious?

  • Primary CTA clearly labeled (e.g. “Download CSV”, “Export chart”)

  • CTA states whether signup or payment is required

  • No surprise friction after click

  • Multiple extraction paths visible if applicable:

    • raw data
    • sample / filtered data
    • visualization export

Rule: no ambiguity about what happens next.


Section 3: Immediate Use (enables extraction)

Can users assess usefulness before committing?

  • Inline preview present (table, schema, or visualization)
  • Preview shows real data (not placeholders)
  • At least one low-friction extraction option available (sample download or chart export)

If users must “trust blindly,” extraction drops sharply.


Section 4: Trust & Reliability (supports extraction)

Once relevance is clear, is the dataset credible?

  • Source clearly identified
  • Methodology / notes available (expandable is fine)
  • “Last updated” or update frequency visible
  • Versioning or change log if applicable

Trust is secondary, but missing trust blocks extraction.


Section 5: Lateral Expansion (increases session value)

After one extraction, is there a clear next step?

  • Related datasets shown (topic / geography)
  • Logical “next dataset” suggestions
  • Same-publisher or same-collection links

Goal: multi-dataset sessions, not brand persuasion.


Section 6: Platform Signals (minimal, non-blocking)

Platform identity should not interrupt extraction.

  • Light “Published via DataHub” attribution present
  • Optional “Publish your own dataset” link
  • No modal or interrupt before first extraction

If platform messaging competes with extraction, it fails.


Section 7: Instrumentation Check (mandatory)

Without tracking, optimization is impossible.

  • Page views tracked

  • All extraction events tracked distinctly:

    • raw downloads
    • sample downloads
    • visualization exports
  • Errors / broken links absent

If extraction events are not tracked → page is not auditable.


Final Audit Verdict

After completing the checklist, assign one verdict:

  • ✅ Extraction-ready
  • ⚠️ Partially extractable (fixable)
  • ❌ Extraction-blocked

Only pages marked ✅ should be used as benchmarks.


One-Line Reminder (top of checklist)

“A dataset page succeeds only if a user leaves with a reusable artifact.”



Below are two copy-paste-ready appendices, written to slot directly after the existing documents. They are concise, explicit, and operational.


Appendix C: How the Dataset Page Checklist Maps to Metrics

Purpose: make clear which metrics should move when a checklist item is improved. This ensures audits translate into measurable outcomes.


Section 1: Relevance (Is this the dataset I need?)

Checklist focus:

  • Clear scope, coverage, and problem statement above the fold.

Primary metrics affected:

  • Bounce rate ↓
  • Median time on page ↑
  • Data extractions / 1,000 views ↑

Diagnostic signal:

  • High page views + low time on page (<10s) indicates relevance failure.

Section 2: Extraction Clarity (What happens if I click?)

Checklist focus:

  • Clear CTA labeling, no surprise friction.

Primary metrics affected:

  • Click-through to extraction actions ↑
  • Abandoned extraction attempts ↓
  • Data extractions / 1,000 views ↑

Diagnostic signal:

  • Users hover or scroll but do not click CTAs.

Section 3: Immediate Use (Can I assess value now?)

Checklist focus:

  • Preview, sample data, visualizations.

Primary metrics affected:

  • Preview interactions ↑
  • First extraction rate ↑
  • Data extractions / 1,000 views ↑

Diagnostic signal:

  • Long time on page + low extraction → insufficient preview.

Section 4: Trust & Reliability (Should I rely on this?)

Checklist focus:

  • Source, methodology, freshness.

Primary metrics affected:

  • Completion of extraction after preview ↑
  • Repeat visits to the same dataset ↑

Diagnostic signal:

  • Preview interaction occurs, but extraction does not.

Section 5: Lateral Expansion (What next?)

Checklist focus:

  • Related datasets, logical next steps.

Primary metrics affected:

  • Multi-dataset sessions ↑
  • Extractions per session ↑

Diagnostic signal:

  • Single-page sessions dominate despite successful extraction.

Section 6: Platform Signals (Non-blocking identity)

Checklist focus:

  • Minimal, contextual platform messaging.

Primary metrics affected:

  • None directly (guardrail only)

Diagnostic signal:

  • Any interruption before first extraction correlates with higher bounce.

Section 7: Instrumentation Check

Checklist focus:

  • Correct event tracking.

Primary metrics affected:

  • All metrics become trustworthy.

Diagnostic signal:

  • Page cannot be evaluated or compared.

Summary Mapping Table (conceptual)

Relevance → Time on page, Bounce Clarity → CTA clicks Immediate Use → First extraction Trust → Completion after preview Expansion → Multi-dataset sessions


Appendix D: Benchmark Ranges (Early-Stage, High-Intent Pages)

Purpose: give the team directional targets, not rigid KPIs. These are conservative ranges derived from data portal, marketplace, and CRO research (OECD 2018; Nielsen Norman Group; Statista usage analyses).


Primary Metric Benchmarks

Data Extractions per 1,000 Dataset Page Views

  • <10 → Poor (page likely fails relevance or clarity)
  • 10–25 → Acceptable (baseline utility)
  • 25–50 → Good (clear value realization)
  • 50 → Excellent (benchmark candidate)

Early goal: move top pages into the 25–50 range.


Supporting Benchmarks

Bounce Rate

  • 70% → likely relevance failure

  • 40–70% → normal for search landings
  • <40% → strong alignment with intent

Median Time on Page

  • <10s → misleading search match
  • 10–30s → skim / evaluation
  • 30s → engaged assessment

Multi-Dataset Sessions

  • <10% → no lateral expansion
  • 10–25% → healthy
  • 25% → strong data discovery behavior


Extraction Modality Mix (diagnostic only)

No “correct” mix yet, but watch trends:

  • Rising visualization exports → non-technical demand increasing
  • Rising raw downloads → analyst / developer demand
  • High sample-only usage → preview useful, full access unclear or gated too early

Guardrails (What not to optimize for yet)

  • Email capture rate
  • Account creation rate
  • Subscription rate

These should improve after extraction density improves.


One-Line Benchmark Reminder

“A page with traffic but no extraction is not an asset—it’s leakage.”


If you want next, I can:

  • condense all appendices into a single internal playbook, or
  • translate benchmarks into automated alerts (e.g. flag pages below threshold).