Project Architecture

DataHub.io is a multitenant Next.js app for publishing data and data-driven content. Users create publications (like a Substack newsletter but for data), add posts (datasets, Observable notebooks) synced from GitHub, and readers can subscribe, like, and browse.

Tech stack

  • Framework: Next.js 14 (App Router) with TypeScript
  • Database: PostgreSQL (Neon on Vercel in prod, Docker locally) with Prisma ORM
  • Storage: Cloudflare R2 (prod/staging), MinIO (local) for raw content files and trees
  • Authentication: NextAuth v4 with GitHub OAuth
  • Background Jobs: Inngest for sync/delete workflows
  • Search: Typesense for content indexing
  • API Layer: tRPC v10 with React Query
  • Styling: Tailwind CSS with Headless UI components
  • Deployment: Vercel
  • Content Processing: Cloudflare Worker (external repo) for markdown parsing and metadata extraction
image

High-level data flow

  1. User authenticates via GitHub OAuth and creates a publication.
  2. User adds posts to the publication, each linked to a GitHub repository/branch.
  3. A sync is triggered (manual or GitHub webhook) via app/api/webhook/route.ts.
  4. Inngest sync function pulls files from GitHub, writes raw files to R2/MinIO, and creates/updates Blob rows in Postgres.
  5. Each markdown upload to storage triggers the Cloudflare Worker via queue.
  6. The worker parses markdown, updates Blob metadata, and indexes in Typesense.
  7. Next.js serves pages at /<publication-slug>/<post-name> using metadata from Postgres and raw content from storage.

Directory structure

app/                        # Next.js App Router pages
  [publication]/             # Public publication pages
    [post]/[[...slug]]/      # Post content renderer (dataset/notebook pages)
    subscribe/               # Publication subscription page
  api/                       # API route handlers
    auth/[...nextauth]/      # NextAuth endpoints
    inngest/                 # Inngest webhook handler
    trpc/[trpc]/             # tRPC endpoint
    webhook/                 # GitHub webhook intake
    generate/                # AI content generation
  dashboard/                 # Authenticated dashboard
    admin/                   # Admin panel
    posts/                   # Post management (new, edit, settings)
    publications/            # Publication management
    subscriptions/           # User subscription management
  home/                      # Marketing/landing pages
    collections/             # Data collections browser
    pricing/                 # Pricing page
    publish/                 # Publish CTA page
    solutions/               # Solution pages
  login/                     # Authentication page
  user/[username]/           # Public user profile page
components/                  # React components
  auth/                      # Authentication components
  form/                      # Form components
  front/                     # Public-facing site components
  icons/                     # Icon components
  layouts/                   # Layout components
  preview/                   # Content preview components
  MDX.tsx                    # MDX compilation component
  mdx-components-factory.tsx # MDX component mappings
middleware.ts                    # URL routing for multitenant app (rewrites, auth guards)
lib/                         # Shared utilities
  __tests__/                 # Unit tests for lib functions
  front/                     # Frontend-specific utilities
  hooks/                     # React hooks
  markdown.ts                # Markdown processing pipeline
  content-store.ts           # S3/MinIO storage abstraction
  github.ts                  # GitHub API client helpers
  actions.ts                 # Server actions
  app-config.ts              # Application configuration loader
server/                      # Server-side code
  api/
    routers/                 # tRPC router definitions
      __tests__/             # Router unit tests
      post.ts                # Post CRUD, sync, settings
      publication.ts         # Publication CRUD
      like.ts                # Like/unlike posts
      subscription.ts        # Publication subscriptions
      user.ts                # User profile management
      home.ts                # Homepage data (featured posts, collections)
    root.ts                  # tRPC router composition
    trpc.ts                  # tRPC context and middleware
  auth.ts                    # NextAuth configuration
  db.ts                      # Prisma client singleton
inngest/                     # Background job definitions
  client.ts                  # Inngest client and event types
  functions.ts               # Sync and delete function implementations
prisma/
  schema.prisma              # Database schema
  migrations/                # Prisma migrations
e2e/                         # End-to-end tests (Playwright)
  renderer/                  # Public page rendering tests
  fixtures/                  # Test fixtures and helpers
  global-setup.ts            # Test seeding and setup

Domain model

Entities defined in prisma/schema.prisma:

ModelPurpose
UserAuthenticated user (GitHub OAuth)
PublicationA named publication owned by a user (like a Substack)
PostA content item (dataset or Observable notebook) linked to a GitHub repo, belongs to a Publication
BlobA file record (metadata + sync status) belonging to a Post
LikeUser-to-Post like relationship
PostStatDaily views/downloads stats per Post
PostAuthorMany-to-many Post-to-User author relationship
PublicationSubscriptionUser subscription to a Publication
Account, Session, VerificationTokenNextAuth authentication models

Key relationships:

  • User 1->N Publication (owner)
  • Publication 1->N Post
  • Post 1->N Blob
  • Post N<->N User (via PostAuthor)
  • Post 1->N Like
  • Post 1->N PostStat
  • Publication 1->N PublicationSubscription
erDiagram
  User ||--o{ Publication : owns
  User ||--o{ Like : gives
  User ||--o{ PostAuthor : authors
  User ||--o{ PublicationSubscription : subscribes
  Publication ||--o{ Post : contains
  Publication ||--o{ PublicationSubscription : has
  Post ||--o{ Blob : has_files
  Post ||--o{ Like : receives
  Post ||--o{ PostStat : tracks
  Post ||--o{ PostAuthor : has_authors

Post types (enum PostType):

  • DATAPACKAGE - standard markdown/data dataset
  • OBSERVABLE - Observable Framework notebook

Major components

Authentication

  • NextAuth config: server/auth.ts
  • OAuth endpoints: app/api/auth/[...nextauth]/route.ts
  • Login page: app/login/page.tsx

Dashboard (authenticated)

  • Layout: app/dashboard/layout.tsx
  • Main dashboard: app/dashboard/page.tsx
  • Publication management: app/dashboard/publications/
  • Post management: app/dashboard/posts/
  • Subscription management: app/dashboard/subscriptions/page.tsx
  • Profile settings: app/dashboard/settings/page.tsx
  • Admin panel: app/dashboard/admin/page.tsx

Public rendering (publication/post pages)

  • Publication index: app/[publication]/page.tsx - lists posts in a publication
  • Post renderer: app/[publication]/[post]/[[...slug]]/page.tsx - renders dataset/notebook content
  • Subscribe page: app/[publication]/subscribe/page.tsx
  • User profile: app/user/[username]/page.tsx
  • MDX compilation: components/MDX.tsx
  • Markdown/MDX processing: lib/markdown.ts
  • MDX component factory: components/mdx-components-factory.tsx
  • Remark plugins (Obsidian wiki-links, embeds, callouts): lib/remark-plugins.tsx

Content sync (Inngest)

  • GitHub client: lib/github.ts
  • Inngest client + events: inngest/client.ts
  • Sync/delete functions: inngest/functions.ts
  • GitHub webhook intake: app/api/webhook/route.ts
  • Inngest handler: app/api/inngest/route.ts

Storage

  • S3/MinIO abstraction: lib/content-store.ts
  • Raw content served via storage redirect

Homepage / marketing

All landing pages live in app/home/ and are served at the root domain via middleware rewrites (e.g. datahub.io/pricing -> app/home/pricing/page.tsx).

  • Landing page: app/home/page.tsx (served at /)
  • Pricing: app/home/pricing/page.tsx (served at /pricing)
  • Publish CTA: app/home/publish/page.tsx (served at /publish)
  • Collections: app/home/collections/page.tsx (served at /collections)
  • Solutions: app/home/solutions/page.tsx (served at /solutions)
    • Sub-pages: logistics/, global-country-region-reference-data/, worldwide-postal-code-database/

Middleware routing (middleware.ts)

The middleware handles all URL routing for the multitenant app:

  • /, /pricing, /collections, /solutions, /publish -> rewritten to app/home/*
  • /@username -> rewritten to app/user/*
  • /<publication>/<post>/_r/-/<file> -> rewritten to raw file API
  • /<publication>/<post>/r/<file> -> rewritten to legacy raw file API
  • /<publication>/<post>/datapackage.json -> rewritten to raw file API
  • /dashboard/* -> auth guard (redirect to /login if unauthenticated)
  • /login -> redirect to /dashboard if already authenticated
  • Custom domains -> rewritten to /_domain={hostname}/*

API surface

HTTP route handlers (app/api/)

RouteFilePurpose
/api/auth/[...nextauth]app/api/auth/[...nextauth]/route.tsNextAuth OAuth endpoints
/api/trpc/[trpc]app/api/trpc/[trpc]/route.tstRPC endpoint
/api/webhookapp/api/webhook/route.tsGitHub push webhook
/api/inngestapp/api/inngest/route.tsInngest function handler
/api/generateapp/api/generate/route.tsAI content generation

tRPC routers (via /api/trpc/[trpc])

Composed in server/api/root.ts:

RouterFileKey procedures
postserver/api/routers/post.tsCRUD, sync, settings, tree management
publicationserver/api/routers/publication.tsCreate, update, delete publications
likeserver/api/routers/like.tsLike/unlike posts, get like status
subscriptionsserver/api/routers/subscription.tsSubscribe/unsubscribe to publications
userserver/api/routers/user.tsProfile management, avatar upload
homeserver/api/routers/home.tsHomepage data, featured content

Sync architecture (Inngest)

Configuration:

  • inngest/client.ts - Event definitions and client setup
  • inngest/functions.ts - Sync and delete functions
graph TD
    subgraph "Trigger Sources"
        GH[GitHub Push] -->|Webhook| WH[Webhook Handler]
        UI[Manual UI Trigger] -->|Direct| IE[Inngest Event]
        WH -->|Validate & Filter| IE
    end

    subgraph "Event Processing"
        IE -->|post/sync| SF[Sync Function]
        IE -->|post/delete| DF[Delete Function]
    end

Trigger points

  • Automatic: GitHub push webhooks to the tracked branch, validated by webhook secret.
  • Manual: user-initiated sync from dashboard with force option.
  • Initial sync: on post creation, full repository processing.

Sync flow

  1. Fetch post details and publication/user info from Postgres.
  2. Update sync status to PENDING.
  3. Load post configuration, include/exclude patterns, validate root directory.
  4. Fetch current tree from storage and repo tree from GitHub API.
  5. Compare SHA hashes to detect changes (early exit if trees match).
  6. For changed files:
    • Filter by supported extensions (.md, .json, .yaml).
    • Apply include/exclude patterns.
    • Upload raw file to R2/MinIO.
    • Parse frontmatter, compute metadata (title, description, URL).
    • Store/update Blob metadata in Postgres.
  7. Handle deletions (remove files from storage, clean metadata).
  8. Upload new tree, update sync status to SUCCESS, revalidate Next.js cache tags.

Error handling

  • Concurrency limit: 5 per account.
  • Cancellation: on new sync or delete events for the same post.
  • Non-retriable errors: invalid root dir, YAML parse errors, invalid datapackage format.
  • Retriable errors: GitHub API rate limits, network timeouts, storage upload failures.
  • Error messages stored in database, visible in dashboard.

Cloudflare Worker (external repo)

Repository: datahub-cloudflare-workers (cloned alongside this repo, default ../datahub-cloudflare-workers).

Purpose: process markdown uploads from R2/MinIO, parse frontmatter, update Blob metadata, and index content in Typesense.

  • Consumes queue messages from R2 events (prod/staging) or MinIO webhook (dev at /queue).
  • Skips non-markdown files, supports publish: false to delete content.
  • Storage key pattern: {postId}/{branch}/raw/{path}

Queues:

  • Dev: markdown-processing-queue-dev
  • Staging: datahub-markdown-queue-staging
  • Prod: datahub-markdown-queue

External services

ServicePurposeConfig location
PostgreSQL (Neon/Vercel)Users, publications, posts, blobs, metadataPOSTGRES_PRISMA_URL in .env
R2 (Cloudflare) / MinIORaw markdown files, tree blobsS3_* vars in .env
TypesenseSearch indexingWorker config
InngestBackground sync/delete jobsinngest/functions.ts
GitHub OAuthAuthenticationNEXT_PUBLIC_AUTH_GITHUB_ID, AUTH_GITHUB_SECRET
PosthogAnalyticsNEXT_PUBLIC_POSTHOG_KEY
BrevoEmail notificationsBREVO_* vars
Cloudflare TurnstileCaptchaTURNSTILE_* vars

Testing

Unit tests (Jest)

  • Location: lib/__tests__/, server/api/routers/__tests__/, lib/*.test.ts
  • Run: pnpm test
  • Framework: Jest + Testing Library

E2E tests (Playwright)

  • Location: e2e/renderer/ (public page tests)
  • Config: playwright.config.ts
  • Global setup: e2e/global-setup.ts (seeds test data via Inngest)
  • Run: pnpm exec playwright test --project=renderer-chromium
  • See: e2e/README.md for full guide

CI

  • GitHub Actions workflows in .github/workflows/:
    • e2e.yml - E2E tests against Vercel preview deployments
    • lint.yml - ESLint checks
    • unit.yml - Jest unit tests
    • sync-ga-views.yml - Google Analytics view sync