Push Data, Get Interface: Evolving DataHub into an AI-Native Data API

Context

I’ll recreate what I’ve heard so far, and you can correct me, Anu.

One question was: how can Luccas get involved in updating datasets on DataHub?

The background context is that we see an opportunity emerging. There is growing usage of tools like Supabase or Resend. We think this is driven by AI because AI is familiar with those tools: they are well-documented, CLI/API-oriented, and integrate cleanly with the React ecosystem and tooling AI tends to use to build apps.

That suggests that one of the things AI needs to do is store and share data, and particularly use data efficiently. AI is very inefficient in terms of tokens if you ask it to parse a 10MB CSV. It might use tools, but this is not optimal. Just as AI could set up its own SMTP server but instead uses Resend because it provides a clean API, we want to be that API for handling data: storing it, shaping it, and making it ready for analysis.

This is an AI-oriented framing. It could be for big data, but especially for small data.

That leads to the question: how do we evolve DataHub in this direction? What does the development process look like? What should we build?

My summary is that I would develop two tools.

First, a tool for organizing data wrangling workflows. The wrangling would mostly be done by AI, but a “Data Presser” assistant would provide skills and infrastructure, encapsulating best practices and shaping what’s going on.

Second, a command-line tool for publishing to DataHub and interacting with DataHub. That led to the question: what does the DataHub API look like that this CLI would use?

Boiled down: I want to push file or files and get back a useful interface.

By “useful interface,” I mean:

  • A UI for humans to consume the data.
  • An API for tools or AI to query and use the data.

Crudely: I want to push data and get a CSV viewer and query interface. For AI, I want something like a DuckDB interface with a pipe to it.

I can run DuckDB locally, but I want this in the cloud. I want to share it, reproduce workflows, move data around. I also want useful extras: push data and automatically get summary statistics and general information about the dataset.

Right now, if I’m wrangling data, I need to install Pandas, NumPy, and other dependencies. I don’t want to do that. If I can push data to the cloud and wrangle it there, that’s compelling.

For prototyping, I would probably ignore DataHub.io initially and build a minimal stack on R2 with just the minimal AI API I want. That allows rapid iteration. Once stable, I would back-port it into DataHub.


Distilled Key Points

Situation

  • AI-native tools (Supabase, Resend) succeed because they are API-first, well-documented, and CLI-oriented.
  • AI needs to store, share, and efficiently query data.
  • Current workflows (e.g. Pandas + local installs) are friction-heavy.

Complication

  • DataHub is not yet optimized as an AI-first data API.
  • AI usage of raw files (e.g. large CSVs) is token-inefficient.
  • There is no minimal “push data → get UI + query API” pipeline.

Core Proposition

Build DataHub as the “Resend for data”:

  • Push files.

  • Automatically get:

    • Human UI (viewer, summary stats).
    • Machine API (queryable, e.g. DuckDB interface in the cloud).

Proposed Architecture Direction

  1. Data Presser Assistant

    • AI-driven wrangling workflows.
    • Encodes best practices.
    • Organizes transformation pipelines.
  2. DataHub CLI

    • Push files.

    • Receive:

      • Web UI.
      • Query endpoint (DuckDB-like).
      • Basic auto-profiling.
  3. Prototype Strategy

    • Build minimal stack on R2.
    • Implement minimal AI-facing API.
    • Iterate quickly.
    • Back-port into DataHub core later.

Immediate Next Action Questions

Keep these tight:

  1. What is the minimal contract of “push → get back”?

    • Viewer only?
    • Viewer + SQL endpoint?
    • Viewer + SQL + profiling?
  2. What is the minimal query interface?

    • REST SQL endpoint?
    • DuckDB-over-HTTP?
    • Something even simpler?
  3. Is the first target user:

    • AI agent?
    • CLI power user?
    • Human data explorer?
  4. What is the smallest viable demo?

    • Upload CSV → get auto-profile + SQL endpoint in <60 seconds?

If you answer those four, the prototype spec will almost write itself.

Refined MVP

Clarified Answers

1. Minimal Contract (“push → get back”)

  • Push CSV

  • Get back:

    • Web viewer (table UI)
    • SQL endpoint (queryable)
    • Optional auto-profile (summary stats)

That is sufficient and coherent.


2. Minimal Query Interface

Given speed and simplicity:

  • HTTP SQL endpoint POST /query with SQL → returns JSON (Backed by DuckDB internally.)

This avoids DuckDB-over-HTTP complexity while still:

  • AI-friendly
  • CLI-friendly
  • Easy to document
  • Easy to iterate

You can evolve toward DuckDB-over-HTTP later if needed.


3. First Target User

  1. CLI power user
  2. Then AI agent

Correct sequencing. If CLI works cleanly, AI will work automatically.


4. Smallest Viable Demo

Strong candidate:

Upload CSV → get:

  • URL to viewer
  • /query endpoint
  • Auto profile JSON in <60 seconds

Even smaller:

Upload CSV → instant viewer

That is the absolute minimal wedge.

Streaming UI while uploading is attractive but non-essential for MVP. It adds complexity without increasing learning value.


Refined MVP Definition

DataHub Lite v0

datahub push data.csv

Returns:

  • Viewer URL
  • Query endpoint
  • Profile summary

Backed by:

  • R2 storage
  • DuckDB runtime
  • Simple HTTP query API

Strategic Note

You have now defined something very sharp:

“Resend for tabular data.”

Push file → instantly usable in browser + via SQL.

That is clear, differentiated, and AI-native.


One Final Design Question (Critical)

Is the dataset:

  • Immutable after push?
  • Or updatable/replaceable?

Immutability simplifies everything and accelerates shipping.

Raw

Prompt

Questions were asked at the moment. So maybe I'll try and recreate what I've heard so far, and you can correct me as well, Anu. So one question I've heard is, the prompt had been she was like, how can Lucas get involved in updating data sets on Data Hub? And the other background context was, we see this opportunity emerging of like, there's growing usage of tools like Superbase or Resend. And this is, we think, driven by AI because AI is kind of familiar with those tools because they're well-documented, they have an orientation to kind of CLI, API usage and integration, and also with like the React ecosystem and tooling that AI tends to use to build apps, right? And so that makes us think, like, well, one of the things AI is needs to do is store data and share data, maybe. And in particularly, actually efficiently use data, because one of the other things for AI is like, one of the things we've been doing recently is like, I want to plug into data, but AI is actually very inefficient in terms of using tokens if you get it to like kind of parse a, you know, a 10 megabyte CSV file probably. You know, it might use certain tools, but putting these things, so one of the things AI is want to do is like, it could work out a way to set up its own SMTP server, but it uses Resend instead because Resend is an API that kind of takes care of all of that for you. So similarly, we want to kind of be that API that takes care of like handling data, storing data, but like maybe allowing it to make it ready for kind of analysis of certain kinds, right? And this is like this kind of AI orientation, particularly maybe, you know, it could be for really big data, but for small data. So this brought us to the experience of like how would we develop DataHub to be oriented that way, in general. And I also make the point that, by the way, I wanted to make a side point in my experience that when AI, the way AI wants to use something is also quite a good way that many users want to use something when it goes to look for the docs, the way it wants to use tools. Maybe its command-line orientation is a bit different, but as an actually underlying kind of bare bones on which you could build a UI, it's actually often very good at, like, what it wants is actually quite sensible, right? And it's often what users are looking for. And so the question then was, like, how do we iterate? Just as I, the thing that I then ran with, which I think is a question you didn't ask, but I was like, how would I evolve DataHub in this direction, both in terms of process, what way would I do the development work to kind of get there? And, you know, what should we, what should we, what should we have? And, you know, as a summary, I would say, well, I would develop two tools. One at the moment, one is like really a tool for kind of organizing data wrangling workflows, the data wrangling mostly being done by AI, but like where this kind of data presser assistant provides the skills, kind of infrastructure, you know, shaping what's going on. And, you know, maybe just even kind of encapsulating knowledge about best practice of how we do this. And the other would be, you know, even though I say it's maybe a tool or it's a repo or kind of set of skills, whatever. And then the other end, there's like how do I, there's a command line tool for publishing to DataHub and interacting with DataHub. And then there's also like, there was a discussion that we evolved into of like, what does DataHub look like? What does the API to DataHub look like that that command line tool will use? And, you know, if I were to boil it down into a kind of a sentence or two, I basically want to be able to push file or files. And get back useful interface. Now useful, and that useful interface, so I want to kind of give it, give it sticky data or content and data, and I get back a UI for humans to consume or look at, and I want to get back an API that's useful for data, sorry, for, you know, tools or AI to use around that data. I want to be able to kind of query or use it in some way. You know, to put that most crudely, I want to push up data, and, you know, I want to get a CSV viewer and kind of queryer if I'm pushing a CSV, and for an API or an AI, you know, for an AI, I'd want something which allows me to basically have a DuckDB interface, you know, just kind of give me DuckDB with a kind of pipe to it, you know. And that would be kind of the kind of thing that I would want. And, you know, I might wanna be able to do something like, I can run DuckDB locally, but what I wanna do this in the cloud, well, I wanna share it with other people. I wanna, you know, I wanna be able to reproduce what I'm doing. I wanna move data around in this way. I might also want other certain useful things that it would do for me, like I push some data, and it's like this, it will give me a bunch of summary statistics and just kind of general info about my data, which is kind of helpful. And, you know, the thing that I'm starting to find, like, the other day, like, I've got some data wrangling I'm doing, and I need to install Pandas and, you know, I'm gonna fucking install NumPy and Pandas and all kinds of stuff. I don't wanna do that. You know, if I can just push it to the cloud and like wrangle it there, that's kind of cool. So, uh, You know, I, I don't know, that, that was like a recreation of some of your questions and then kind of some of where I was going. The final point was like, if we wanted to build that out, I'd almost like, for the moment, for a prototype, lose DataHub.io, create my own little, like, mini stack built on R2 with just like the minimal AI API I want because I can iterate very quickly with that. And then when that's stable, I would back pull that into DataHub. Yeah? Okay, can you create a tidy transcript of what I just said, using all of my original language, but removing filler words, adding some formatting? And then can you distill very concisely the key points that I'm making, probably in a maybe situation-complication-question or just in a minimal way, and what I'm suggesting as next actions or any next prompt questions. Keep them simple that I should be addressing.

Answering the 4 questions

answering the four questions asked so far, I think the minimal attract is push and get back a viewer plus an SQL endpoint or some kind of endpoint. What's the minimal query interface? I think probably it would be, I don't know if db over HTTP is possible. The first kind of tech target user, I think, is the CLI power user, followed by the AI agent. And what's the smallest viable demo is upload CSV and get maybe an auto profile endpoint in less than 60 seconds. That would be cool. Yeah. Even easier, the other one would be push and get a UI, you know, push a CSV and get a UI back in 60 seconds or even faster. What would be cool is, you know, is even already working as the CSV is streaming in, but that's super fancy.