Dataset Push API (Direct Upload, No GitHub)
Dataset Push API (Direct Upload, No GitHub)
Date: 2026-03-06 Context: Internal-only API for the datahub.io team to push datasets programmatically without a GitHub repository. Intended to be used by the datapressr CLI tool.
Motivation
The current publication/post workflow requires a GitHub repository as the source of truth. For automated dataset publishing pipelines (e.g., datapressr), we want to push dataset files directly to R2 and create the necessary DB records, bypassing GitHub and Inngest entirely.
Auth
Single ADMIN_API_KEY environment variable. All requests include:
Authorization: Bearer <ADMIN_API_KEY>
No DB model, no key management UI. Set in .env, share with the team manually.
Schema Changes
Make both GitHub fields nullable (one migration, safe — all existing rows already have values):
ghRepository String? @map("gh_repository") // null = direct push
ghBranch String? @map("gh_branch") // null = direct push; R2 key always uses "main"
API Endpoints
Create publication
POST /api/v1/publications
Body: { slug: string, name?: string }
Returns: { id, slug }
Create dataset
POST /api/v1/publications/:slug/datasets
Body: { name: string, title?: string, description?: string }
Returns: { id, name }
Creates a Post with ghRepository=null, ghBranch=null. No webhook, no Inngest event.
Upload file (get presigned URL)
POST /api/v1/publications/:slug/datasets/:name/files
Body: {
path: string, // e.g. "data/gdp.csv"
size: number, // bytes
contentType: string, // MIME type
content?: string // raw content, required for datapackage.json
}
Returns: { uploadUrl: string } // presigned R2 URL, expires 1hr
- Creates (or updates) a Blob record with
syncStatus: SUCCESS - If
contentis provided and path matchesdatapackage.json, parses JSON and stores asblob.metadata(required for page rendering) - CLI uploads file bytes directly to the returned presigned URL
- Revalidates Next.js cache tags
Delete file
DELETE /api/v1/publications/:slug/datasets/:name/files/*path
Deletes Blob record + R2 file.
Delete dataset
DELETE /api/v1/publications/:slug/datasets/:name
Deletes Post + all Blobs + entire R2 prefix ({postId}/).
File Upload Flow
1. POST /api/v1/publications/:slug/datasets/:name/files
{ path: "datapackage.json", size: 512, contentType: "application/json", content: "{...}" }
→ { uploadUrl: "https://r2...?X-Amz-Signature=..." }
2. PUT {uploadUrl} ← CLI uploads directly to R2
Body: raw file bytes
(repeat for each file; datapackage.json must be sent with content field)
No publish/finalize step. Page is live once files are in R2 and Blob records exist.
Key Constraint
blob.metadata.datapackage must be populated for dataset pages to render (the DataPackageLayout reads from it). This means the CLI must send datapackage.json content inline in the files request. All other files (CSVs, READMEs, etc.) can omit content.
R2 Key Structure
Direct uploads follow the same structure as GitHub-synced datasets, using "main" as the branch:
{postId}/main/raw/{path}
ghBranch is stored as null in the DB to distinguish direct-push posts from GitHub-synced ones, but the R2 key always uses "main" as the branch segment. The upload handler hardcodes "main" when constructing the R2 key.
What Doesn't Change
- GitHub-based sync flow is untouched
/api/raw/endpoint serves files for any post (works for direct uploads)- Dataset page rendering (
DataPackageLayout,ProgrammaticAccessSection) reads from Blobs — works immediately once records exist - Inngest is not involved
CLI Usage (datapressr)
# One-time: create publication
dpr publish init --slug my-org
# Create dataset
dpr publish create my-org/world-gdp --title "World GDP Data"
# Push all files from local directory
dpr publish push my-org/world-gdp ./data/
→ walks directory
→ sends datapackage.json with content inline
→ gets presigned URL for each file
→ uploads directly to R2
Out of Scope
- API key management UI
- Per-publication key scoping
- Publish/finalize step
- Support for non-team users (this is internal only)