Dataset Push API (Direct Upload, No GitHub)

Date: 2026-03-06 Context: Internal-only API for the datahub.io team to push datasets programmatically without a GitHub repository. Intended to be used by the datapressr CLI tool.


Motivation

The current publication/post workflow requires a GitHub repository as the source of truth. For automated dataset publishing pipelines (e.g., datapressr), we want to push dataset files directly to R2 and create the necessary DB records, bypassing GitHub and Inngest entirely.


Auth

Single ADMIN_API_KEY environment variable. All requests include:

Authorization: Bearer <ADMIN_API_KEY>

No DB model, no key management UI. Set in .env, share with the team manually.


Schema Changes

Make both GitHub fields nullable (one migration, safe — all existing rows already have values):

ghRepository  String?  @map("gh_repository")   // null = direct push
ghBranch      String?  @map("gh_branch")        // null = direct push; R2 key always uses "main"

API Endpoints

Create publication

POST /api/v1/publications
Body: { slug: string, name?: string }
Returns: { id, slug }

Create dataset

POST /api/v1/publications/:slug/datasets
Body: { name: string, title?: string, description?: string }
Returns: { id, name }

Creates a Post with ghRepository=null, ghBranch=null. No webhook, no Inngest event.

Upload file (get presigned URL)

POST /api/v1/publications/:slug/datasets/:name/files
Body: {
  path: string,          // e.g. "data/gdp.csv"
  size: number,          // bytes
  contentType: string,   // MIME type
  content?: string       // raw content, required for datapackage.json
}
Returns: { uploadUrl: string }  // presigned R2 URL, expires 1hr
  • Creates (or updates) a Blob record with syncStatus: SUCCESS
  • If content is provided and path matches datapackage.json, parses JSON and stores as blob.metadata (required for page rendering)
  • CLI uploads file bytes directly to the returned presigned URL
  • Revalidates Next.js cache tags

Delete file

DELETE /api/v1/publications/:slug/datasets/:name/files/*path

Deletes Blob record + R2 file.

Delete dataset

DELETE /api/v1/publications/:slug/datasets/:name

Deletes Post + all Blobs + entire R2 prefix ({postId}/).


File Upload Flow

1. POST /api/v1/publications/:slug/datasets/:name/files
   { path: "datapackage.json", size: 512, contentType: "application/json", content: "{...}" }
   → { uploadUrl: "https://r2...?X-Amz-Signature=..." }

2. PUT {uploadUrl}   ← CLI uploads directly to R2
   Body: raw file bytes

(repeat for each file; datapackage.json must be sent with content field)

No publish/finalize step. Page is live once files are in R2 and Blob records exist.


Key Constraint

blob.metadata.datapackage must be populated for dataset pages to render (the DataPackageLayout reads from it). This means the CLI must send datapackage.json content inline in the files request. All other files (CSVs, READMEs, etc.) can omit content.


R2 Key Structure

Direct uploads follow the same structure as GitHub-synced datasets, using "main" as the branch:

{postId}/main/raw/{path}

ghBranch is stored as null in the DB to distinguish direct-push posts from GitHub-synced ones, but the R2 key always uses "main" as the branch segment. The upload handler hardcodes "main" when constructing the R2 key.


What Doesn't Change

  • GitHub-based sync flow is untouched
  • /api/raw/ endpoint serves files for any post (works for direct uploads)
  • Dataset page rendering (DataPackageLayout, ProgrammaticAccessSection) reads from Blobs — works immediately once records exist
  • Inngest is not involved

CLI Usage (datapressr)

# One-time: create publication
dpr publish init --slug my-org

# Create dataset
dpr publish create my-org/world-gdp --title "World GDP Data"

# Push all files from local directory
dpr publish push my-org/world-gdp ./data/
  → walks directory
  → sends datapackage.json with content inline
  → gets presigned URL for each file
  → uploads directly to R2

Out of Scope

  • API key management UI
  • Per-publication key scoping
  • Publish/finalize step
  • Support for non-team users (this is internal only)