Motivation / SCQH

Situation

DataHub has substantial existing distribution:
- ~500k visitors (and ~20 years of SEO history)
- Hundreds of datasets
- High-intent behavior: ~20–50% download rates on datasets
Current operating posture:
- Downloads do not require signup → limited capture of who users are / what they want
- “Premium” offer exists, but inbound interest is not handled reliably/consistently
Current revenue signal:
- One notable one-off customer
- ~3–4 recurring customers, mostly logistics-related
- Roughly ~ $200–$ 300 MRR (and/or ~$500 annual referenced)

Complication

Under-exploiting existing asset (traffic + intent):
- Weak measurement and understanding of conversion (beyond early analytics experiments)
- Limited funnel progression: awareness → consideration → “conversion” (defined at least as signup, not only payment)
- Limited user capture (no signup for download), and limited reliable follow-up on premium intent
Dataset quality + coverage issues:
- Some “core datasets” are out of date despite meaningful traffic (example: gold prices)
- Need to fix/maintain scraping scripts and update pipelines for key datasets
- Not adding new datasets systematically to leverage SEO/distribution (esp. long-tail)
Strategy tension in the background:
- “Substack for Data” / third-party publishers is a possible direction, but currently secondary to exploiting the existing site and catalog
Open uncertainty about focus and sequencing:
- Many possible fronts (conversion instrumentation, conversion improvement, monetization validation, dataset maintenance, dataset expansion)
- Need a working hypothesis for what to do first, when, and with what effort/expected return

Question

Top-level question options (choose/hold as hypotheses):
- Q1: What is the best near-term strategy to convert existing high-intent traffic into measurable relationships (signup) and validated revenue, while improving dataset freshness?
- Q2: Given limited bandwidth, what sequencing of “conversion system” work vs “dataset maintenance/expansion” maximizes learning and impact over the next cycle?
- Q3: What is the minimum viable operating model that reliably turns DataHub’s SEO traffic into (a) updated high-value datasets and (b) a monetizable funnel—without over-investing upfront?
Sub-questions (structured as an issue tree)
- Conversion measurement and baseline
  - What is the current baseline of “conversion” at each stage?
    - What counts as conversion right now (downloads only? any existing signup?)
    - What is the current dataset-level download rate distribution (since you see ~20–50% overall)?
  - How are we currently instrumenting conversion?
    - What events are tracked now (page view, dataset view, download click, outbound, etc.)?
    - Where are the gaps (e.g., download events not reliably captured; no identity capture)?
  - What baseline targets would be meaningful (near-term)?
    - What would “better” look like: higher download rate, higher signup rate, higher premium inquiries, or all three?
- Funnel improvement (awareness → consideration → conversion-as-signup)
  - If “conversion” includes signup, what is the minimal signup capture that doesn’t harm downloads?
    - Do we keep frictionless downloads and add an optional/light capture?
    - Or gate certain actions (e.g., bulk download / API / freshest data) behind signup?
  - What are the most plausible levers to increase conversion without inventing new product surface area?
    - Improve dataset pages for clarity/trust (metadata completeness; freshness indicators)
    - Better calls-to-action around signup/premium on high-intent pages
  - Where is intent highest (by topic/category)?
    - Logistics datasets (days-of-week, geographic info, etc.)
    - “Gold prices” style high-traffic datasets with freshness sensitivity
  - What should be the immediate objective function?
    - Maximize signup capture?
    - Maximize premium inquiries?
    - Maximize successful downloads while capturing attribution?
- Premium and monetization responsiveness
  - What is the current premium offer and how is it presented?
    - What are users currently being offered (and on which pages)?
  - What breaks today in responding to premium interest?
    - Where do inquiries land (email? form?) and what is the failure mode (latency, ownership, process)?
  - What is the minimum reliable workflow to respond consistently?
    - Ownership, SLA, templated responses, qualification questions
  - What is the “validation” step for demand?
    - Which signals count (inbound asks, conversion to calls, paid pilots, upgrades)?
  - How do current paying customers map to dataset categories?
    - Especially the logistics cluster: what are they actually paying for?
- Core dataset freshness and maintenance
  - What proportion of “core datasets” are out of date?
    - By traffic share (not by count): which outdated datasets matter most because they drive significant visits/downloads?
  - What is broken in the update pipeline?
    - Scraping scripts: which ones are failing; how often; why?
    - Data ingestion/refresh cadence: what is desired vs current?
  - What is the minimal operational standard for freshness?
    - A defined refresh schedule for top datasets
    - A visible “last updated” and/or “data current through” marker
- Dataset expansion (systematic publishing)
  - What does “add a lot more datasets systematically” mean in practice?
    - What sources/areas are you prioritizing (long tail; “ordinary data”; competitor catch-up)?
  - What internal tooling/workflow is required to publish more datasets?
    - How much of it is manual vs scripted?
    - What is the bottleneck: discovery, scraping, cleaning, metadata, publishing?
  - What is the expected payoff loop?
    - More datasets → more SEO landings → more high-intent downloads → more signups/premium asks
- Strategic direction (secondary but shaping)
  - How much should “Substack for Data” / third-party publishers influence near-term choices?
    - What foundation (identity, publishing workflow, permissions) would be required later?
    - Which near-term steps are “no-regrets” foundations for that future (e.g., authorship, profiles, signup, analytics)?
- Prioritization, sequencing, and effort hypothesis
  - What is the smallest set of actions that plausibly unlocks the next round of learning?
    - Instrumentation + one funnel change + fix one high-traffic stale dataset (as an example pattern)
  - What is the expected effort level for each front?
    - Analytics / funnel work
    - Premium responsiveness process
    - Fixing top scraping scripts
    - Adding new datasets
  - What are the leading indicators to decide whether to “invest time and energy” further?
    - Signup capture rate improvement
    - Premium inquiries handled + conversion to calls
    - Revenue movement (even small)
    - Maintenance reliability on top datasets