Data Page Audit Checklist

Below is a handoff-ready dataset-page audit checklist, explicitly aligned to the Data Extraction metric.

This is designed so any team member can run an audit in 5–10 minutes per page.

Purpose: identify why a dataset page does or does not lead to successful data extraction.

Use as Yes / No / Needs work.

Section 1: Relevance (answers “Is this the dataset I need?”)

Above the fold, can a user confirm relevance in ≤10 seconds?

Clear, specific title (what + where + when)
One-sentence description stating the problem this dataset solves
Explicit coverage shown:
- Geography
- Time range
- Granularity
License / price clearly visible (free vs paid vs mixed)

If any item fails → extraction probability is low.

Section 2: Extraction Clarity (answers “Can I get something usable?”)

Before clicking, is the extraction outcome obvious?

Primary CTA clearly labeled (e.g. “Download CSV”, “Export chart”)
CTA states whether signup or payment is required
No surprise friction after click
Multiple extraction paths visible if applicable:
- raw data
- sample / filtered data
- visualization export

Rule: no ambiguity about what happens next.

Section 3: Immediate Use (enables extraction)

Can users assess usefulness before committing?

Inline preview present (table, schema, or visualization)
Preview shows real data (not placeholders)
At least one low-friction extraction option available (sample download or chart export)

If users must “trust blindly,” extraction drops sharply.

Section 4: Trust & Reliability (supports extraction)

Once relevance is clear, is the dataset credible?

Source clearly identified
Methodology / notes available (expandable is fine)
“Last updated” or update frequency visible
Versioning or change log if applicable

Trust is secondary, but missing trust blocks extraction.

Section 5: Lateral Expansion (increases session value)

After one extraction, is there a clear next step?

Related datasets shown (topic / geography)
Logical “next dataset” suggestions
Same-publisher or same-collection links

Goal: multi-dataset sessions, not brand persuasion.

Section 6: Platform Signals (minimal, non-blocking)

Platform identity should not interrupt extraction.

Light “Published via DataHub” attribution present
Optional “Publish your own dataset” link
No modal or interrupt before first extraction

If platform messaging competes with extraction, it fails.

Section 7: Instrumentation Check (mandatory)

Without tracking, optimization is impossible.

Page views tracked
All extraction events tracked distinctly:
- raw downloads
- sample downloads
- visualization exports
Errors / broken links absent

If extraction events are not tracked → page is not auditable.

Final Audit Verdict

After completing the checklist, assign one verdict:

✅ Extraction-ready
⚠️ Partially extractable (fixable)
❌ Extraction-blocked

Only pages marked ✅ should be used as benchmarks.

One-Line Reminder (top of checklist)

“A dataset page succeeds only if a user leaves with a reusable artifact.”

Below are two copy-paste-ready appendices, written to slot directly after the existing documents. They are concise, explicit, and operational.

Appendix C: How the Dataset Page Checklist Maps to Metrics

Purpose: make clear which metrics should move when a checklist item is improved. This ensures audits translate into measurable outcomes.

Section 1: Relevance (Is this the dataset I need?)

Checklist focus:

Clear scope, coverage, and problem statement above the fold.

Primary metrics affected:

Bounce rate ↓
Median time on page ↑
Data extractions / 1,000 views ↑

Diagnostic signal:

High page views + low time on page (<10s) indicates relevance failure.

Section 2: Extraction Clarity (What happens if I click?)

Checklist focus:

Clear CTA labeling, no surprise friction.

Primary metrics affected:

Click-through to extraction actions ↑
Abandoned extraction attempts ↓
Data extractions / 1,000 views ↑

Diagnostic signal:

Users hover or scroll but do not click CTAs.

Section 3: Immediate Use (Can I assess value now?)

Checklist focus:

Preview, sample data, visualizations.

Primary metrics affected:

Preview interactions ↑
First extraction rate ↑
Data extractions / 1,000 views ↑

Diagnostic signal:

Long time on page + low extraction → insufficient preview.

Section 4: Trust & Reliability (Should I rely on this?)

Checklist focus:

Source, methodology, freshness.

Primary metrics affected:

Completion of extraction after preview ↑
Repeat visits to the same dataset ↑

Diagnostic signal:

Preview interaction occurs, but extraction does not.

Section 5: Lateral Expansion (What next?)

Checklist focus:

Related datasets, logical next steps.

Primary metrics affected:

Multi-dataset sessions ↑
Extractions per session ↑

Diagnostic signal:

Single-page sessions dominate despite successful extraction.

Section 6: Platform Signals (Non-blocking identity)

Checklist focus:

Minimal, contextual platform messaging.

Primary metrics affected:

None directly (guardrail only)

Diagnostic signal:

Any interruption before first extraction correlates with higher bounce.

Section 7: Instrumentation Check

Checklist focus:

Correct event tracking.

Primary metrics affected:

All metrics become trustworthy.

Diagnostic signal:

Page cannot be evaluated or compared.

Summary Mapping Table (conceptual)

Relevance → Time on page, Bounce Clarity → CTA clicks Immediate Use → First extraction Trust → Completion after preview Expansion → Multi-dataset sessions

Appendix D: Benchmark Ranges (Early-Stage, High-Intent Pages)

Purpose: give the team directional targets, not rigid KPIs. These are conservative ranges derived from data portal, marketplace, and CRO research (OECD 2018; Nielsen Norman Group; Statista usage analyses).

Primary Metric Benchmarks

Data Extractions per 1,000 Dataset Page Views

<10 → Poor (page likely fails relevance or clarity)
10–25 → Acceptable (baseline utility)
25–50 → Good (clear value realization)
50 → Excellent (benchmark candidate)

Early goal: move top pages into the 25–50 range.

Supporting Benchmarks

Bounce Rate

70% → likely relevance failure
40–70% → normal for search landings
<40% → strong alignment with intent

Median Time on Page

<10s → misleading search match
10–30s → skim / evaluation
30s → engaged assessment

Multi-Dataset Sessions

<10% → no lateral expansion
10–25% → healthy
25% → strong data discovery behavior

Extraction Modality Mix (diagnostic only)

No “correct” mix yet, but watch trends:

Rising visualization exports → non-technical demand increasing
Rising raw downloads → analyst / developer demand
High sample-only usage → preview useful, full access unclear or gated too early

Guardrails (What not to optimize for yet)

Email capture rate
Account creation rate
Subscription rate

These should improve after extraction density improves.

One-Line Benchmark Reminder

“A page with traffic but no extraction is not an asset—it’s leakage.”

If you want next, I can:

condense all appendices into a single internal playbook, or
translate benchmarks into automated alerts (e.g. flag pages below threshold).