Motivation / SCQH

Situation

  • DataHub has substantial existing distribution:

    • ~500k visitors (and ~20 years of SEO history)
    • Hundreds of datasets
    • High-intent behavior: ~20–50% download rates on datasets
  • Current operating posture:

    • Downloads do not require signup → limited capture of who users are / what they want
    • “Premium” offer exists, but inbound interest is not handled reliably/consistently
  • Current revenue signal:

    • One notable one-off customer
    • ~3–4 recurring customers, mostly logistics-related
    • Roughly ~200200–300 MRR (and/or ~$500 annual referenced)

Complication

  • Under-exploiting existing asset (traffic + intent):

    • Weak measurement and understanding of conversion (beyond early analytics experiments)
    • Limited funnel progression: awareness → consideration → “conversion” (defined at least as signup, not only payment)
    • Limited user capture (no signup for download), and limited reliable follow-up on premium intent
  • Dataset quality + coverage issues:

    • Some “core datasets” are out of date despite meaningful traffic (example: gold prices)
    • Need to fix/maintain scraping scripts and update pipelines for key datasets
    • Not adding new datasets systematically to leverage SEO/distribution (esp. long-tail)
  • Strategy tension in the background:

    • “Substack for Data” / third-party publishers is a possible direction, but currently secondary to exploiting the existing site and catalog
  • Open uncertainty about focus and sequencing:

    • Many possible fronts (conversion instrumentation, conversion improvement, monetization validation, dataset maintenance, dataset expansion)
    • Need a working hypothesis for what to do first, when, and with what effort/expected return

Question

  • Top-level question options (choose/hold as hypotheses):

    • Q1: What is the best near-term strategy to convert existing high-intent traffic into measurable relationships (signup) and validated revenue, while improving dataset freshness?
    • Q2: Given limited bandwidth, what sequencing of “conversion system” work vs “dataset maintenance/expansion” maximizes learning and impact over the next cycle?
    • Q3: What is the minimum viable operating model that reliably turns DataHub’s SEO traffic into (a) updated high-value datasets and (b) a monetizable funnel—without over-investing upfront?
  • Sub-questions (structured as an issue tree)

    • Conversion measurement and baseline

      • What is the current baseline of “conversion” at each stage?

        • What counts as conversion right now (downloads only? any existing signup?)
        • What is the current dataset-level download rate distribution (since you see ~20–50% overall)?
      • How are we currently instrumenting conversion?

        • What events are tracked now (page view, dataset view, download click, outbound, etc.)?
        • Where are the gaps (e.g., download events not reliably captured; no identity capture)?
      • What baseline targets would be meaningful (near-term)?

        • What would “better” look like: higher download rate, higher signup rate, higher premium inquiries, or all three?
    • Funnel improvement (awareness → consideration → conversion-as-signup)

      • If “conversion” includes signup, what is the minimal signup capture that doesn’t harm downloads?

        • Do we keep frictionless downloads and add an optional/light capture?
        • Or gate certain actions (e.g., bulk download / API / freshest data) behind signup?
      • What are the most plausible levers to increase conversion without inventing new product surface area?

        • Improve dataset pages for clarity/trust (metadata completeness; freshness indicators)
        • Better calls-to-action around signup/premium on high-intent pages
      • Where is intent highest (by topic/category)?

        • Logistics datasets (days-of-week, geographic info, etc.)
        • “Gold prices” style high-traffic datasets with freshness sensitivity
      • What should be the immediate objective function?

        • Maximize signup capture?
        • Maximize premium inquiries?
        • Maximize successful downloads while capturing attribution?
    • Premium and monetization responsiveness

      • What is the current premium offer and how is it presented?

        • What are users currently being offered (and on which pages)?
      • What breaks today in responding to premium interest?

        • Where do inquiries land (email? form?) and what is the failure mode (latency, ownership, process)?
      • What is the minimum reliable workflow to respond consistently?

        • Ownership, SLA, templated responses, qualification questions
      • What is the “validation” step for demand?

        • Which signals count (inbound asks, conversion to calls, paid pilots, upgrades)?
      • How do current paying customers map to dataset categories?

        • Especially the logistics cluster: what are they actually paying for?
    • Core dataset freshness and maintenance

      • What proportion of “core datasets” are out of date?

        • By traffic share (not by count): which outdated datasets matter most because they drive significant visits/downloads?
      • What is broken in the update pipeline?

        • Scraping scripts: which ones are failing; how often; why?
        • Data ingestion/refresh cadence: what is desired vs current?
      • What is the minimal operational standard for freshness?

        • A defined refresh schedule for top datasets
        • A visible “last updated” and/or “data current through” marker
    • Dataset expansion (systematic publishing)

      • What does “add a lot more datasets systematically” mean in practice?

        • What sources/areas are you prioritizing (long tail; “ordinary data”; competitor catch-up)?
      • What internal tooling/workflow is required to publish more datasets?

        • How much of it is manual vs scripted?
        • What is the bottleneck: discovery, scraping, cleaning, metadata, publishing?
      • What is the expected payoff loop?

        • More datasets → more SEO landings → more high-intent downloads → more signups/premium asks
    • Strategic direction (secondary but shaping)

      • How much should “Substack for Data” / third-party publishers influence near-term choices?

        • What foundation (identity, publishing workflow, permissions) would be required later?
        • Which near-term steps are “no-regrets” foundations for that future (e.g., authorship, profiles, signup, analytics)?
    • Prioritization, sequencing, and effort hypothesis

      • What is the smallest set of actions that plausibly unlocks the next round of learning?

        • Instrumentation + one funnel change + fix one high-traffic stale dataset (as an example pattern)
      • What is the expected effort level for each front?

        • Analytics / funnel work
        • Premium responsiveness process
        • Fixing top scraping scripts
        • Adding new datasets
      • What are the leading indicators to decide whether to “invest time and energy” further?

        • Signup capture rate improvement
        • Premium inquiries handled + conversion to calls
        • Revenue movement (even small)
        • Maintenance reliability on top datasets