Data Page Audit Checklist
Below is a handoff-ready dataset-page audit checklist, explicitly aligned to the Data Extraction metric.
This is designed so any team member can run an audit in 5–10 minutes per page.
Purpose: identify why a dataset page does or does not lead to successful data extraction.
Use as Yes / No / Needs work.
Section 1: Relevance (answers “Is this the dataset I need?”)
Above the fold, can a user confirm relevance in ≤10 seconds?
- Clear, specific title (what + where + when)
- One-sentence description stating the problem this dataset solves
- Explicit coverage shown:
- Geography
- Time range
- Granularity
- License / price clearly visible (free vs paid vs mixed)
If any item fails → extraction probability is low.
Section 2: Extraction Clarity (answers “Can I get something usable?”)
Before clicking, is the extraction outcome obvious?
-
Primary CTA clearly labeled (e.g. “Download CSV”, “Export chart”)
-
CTA states whether signup or payment is required
-
No surprise friction after click
-
Multiple extraction paths visible if applicable:
- raw data
- sample / filtered data
- visualization export
Rule: no ambiguity about what happens next.
Section 3: Immediate Use (enables extraction)
Can users assess usefulness before committing?
- Inline preview present (table, schema, or visualization)
- Preview shows real data (not placeholders)
- At least one low-friction extraction option available (sample download or chart export)
If users must “trust blindly,” extraction drops sharply.
Section 4: Trust & Reliability (supports extraction)
Once relevance is clear, is the dataset credible?
- Source clearly identified
- Methodology / notes available (expandable is fine)
- “Last updated” or update frequency visible
- Versioning or change log if applicable
Trust is secondary, but missing trust blocks extraction.
Section 5: Lateral Expansion (increases session value)
After one extraction, is there a clear next step?
- Related datasets shown (topic / geography)
- Logical “next dataset” suggestions
- Same-publisher or same-collection links
Goal: multi-dataset sessions, not brand persuasion.
Section 6: Platform Signals (minimal, non-blocking)
Platform identity should not interrupt extraction.
- Light “Published via DataHub” attribution present
- Optional “Publish your own dataset” link
- No modal or interrupt before first extraction
If platform messaging competes with extraction, it fails.
Section 7: Instrumentation Check (mandatory)
Without tracking, optimization is impossible.
-
Page views tracked
-
All extraction events tracked distinctly:
- raw downloads
- sample downloads
- visualization exports
-
Errors / broken links absent
If extraction events are not tracked → page is not auditable.
Final Audit Verdict
After completing the checklist, assign one verdict:
- ✅ Extraction-ready
- ⚠️ Partially extractable (fixable)
- ❌ Extraction-blocked
Only pages marked ✅ should be used as benchmarks.
One-Line Reminder (top of checklist)
“A dataset page succeeds only if a user leaves with a reusable artifact.”
Below are two copy-paste-ready appendices, written to slot directly after the existing documents. They are concise, explicit, and operational.
Appendix C: How the Dataset Page Checklist Maps to Metrics
Purpose: make clear which metrics should move when a checklist item is improved. This ensures audits translate into measurable outcomes.
Section 1: Relevance (Is this the dataset I need?)
Checklist focus:
- Clear scope, coverage, and problem statement above the fold.
Primary metrics affected:
- Bounce rate ↓
- Median time on page ↑
- Data extractions / 1,000 views ↑
Diagnostic signal:
- High page views + low time on page (<10s) indicates relevance failure.
Section 2: Extraction Clarity (What happens if I click?)
Checklist focus:
- Clear CTA labeling, no surprise friction.
Primary metrics affected:
- Click-through to extraction actions ↑
- Abandoned extraction attempts ↓
- Data extractions / 1,000 views ↑
Diagnostic signal:
- Users hover or scroll but do not click CTAs.
Section 3: Immediate Use (Can I assess value now?)
Checklist focus:
- Preview, sample data, visualizations.
Primary metrics affected:
- Preview interactions ↑
- First extraction rate ↑
- Data extractions / 1,000 views ↑
Diagnostic signal:
- Long time on page + low extraction → insufficient preview.
Section 4: Trust & Reliability (Should I rely on this?)
Checklist focus:
- Source, methodology, freshness.
Primary metrics affected:
- Completion of extraction after preview ↑
- Repeat visits to the same dataset ↑
Diagnostic signal:
- Preview interaction occurs, but extraction does not.
Section 5: Lateral Expansion (What next?)
Checklist focus:
- Related datasets, logical next steps.
Primary metrics affected:
- Multi-dataset sessions ↑
- Extractions per session ↑
Diagnostic signal:
- Single-page sessions dominate despite successful extraction.
Section 6: Platform Signals (Non-blocking identity)
Checklist focus:
- Minimal, contextual platform messaging.
Primary metrics affected:
- None directly (guardrail only)
Diagnostic signal:
- Any interruption before first extraction correlates with higher bounce.
Section 7: Instrumentation Check
Checklist focus:
- Correct event tracking.
Primary metrics affected:
- All metrics become trustworthy.
Diagnostic signal:
- Page cannot be evaluated or compared.
Summary Mapping Table (conceptual)
Relevance → Time on page, Bounce Clarity → CTA clicks Immediate Use → First extraction Trust → Completion after preview Expansion → Multi-dataset sessions
Appendix D: Benchmark Ranges (Early-Stage, High-Intent Pages)
Purpose: give the team directional targets, not rigid KPIs. These are conservative ranges derived from data portal, marketplace, and CRO research (OECD 2018; Nielsen Norman Group; Statista usage analyses).
Primary Metric Benchmarks
Data Extractions per 1,000 Dataset Page Views
- <10 → Poor (page likely fails relevance or clarity)
- 10–25 → Acceptable (baseline utility)
- 25–50 → Good (clear value realization)
-
50 → Excellent (benchmark candidate)
Early goal: move top pages into the 25–50 range.
Supporting Benchmarks
Bounce Rate
-
70% → likely relevance failure
- 40–70% → normal for search landings
- <40% → strong alignment with intent
Median Time on Page
- <10s → misleading search match
- 10–30s → skim / evaluation
-
30s → engaged assessment
Multi-Dataset Sessions
- <10% → no lateral expansion
- 10–25% → healthy
-
25% → strong data discovery behavior
Extraction Modality Mix (diagnostic only)
No “correct” mix yet, but watch trends:
- Rising visualization exports → non-technical demand increasing
- Rising raw downloads → analyst / developer demand
- High sample-only usage → preview useful, full access unclear or gated too early
Guardrails (What not to optimize for yet)
- Email capture rate
- Account creation rate
- Subscription rate
These should improve after extraction density improves.
One-Line Benchmark Reminder
“A page with traffic but no extraction is not an asset—it’s leakage.”
If you want next, I can:
- condense all appendices into a single internal playbook, or
- translate benchmarks into automated alerts (e.g. flag pages below threshold).