Project Architecture
Project Architecture
DataHub.io is a multitenant Next.js app for publishing data and data-driven content. Users create publications (like a Substack newsletter but for data), add posts (datasets, Observable notebooks) synced from GitHub, and readers can subscribe, like, and browse.
Tech stack
- Framework: Next.js 14 (App Router) with TypeScript
- Database: PostgreSQL (Neon on Vercel in prod, Docker locally) with Prisma ORM
- Storage: Cloudflare R2 (prod/staging), MinIO (local) for raw content files and trees
- Authentication: NextAuth v4 with GitHub OAuth
- Background Jobs: Inngest for sync/delete workflows
- Search: Typesense for content indexing
- API Layer: tRPC v10 with React Query
- Styling: Tailwind CSS with Headless UI components
- Deployment: Vercel
- Content Processing: Cloudflare Worker (external repo) for markdown parsing and metadata extraction
High-level data flow
- User authenticates via GitHub OAuth and creates a publication.
- User adds posts to the publication, each linked to a GitHub repository/branch.
- A sync is triggered (manual or GitHub webhook) via
app/api/webhook/route.ts. - Inngest sync function pulls files from GitHub, writes raw files to R2/MinIO, and creates/updates
Blobrows in Postgres. - Each markdown upload to storage triggers the Cloudflare Worker via queue.
- The worker parses markdown, updates
Blobmetadata, and indexes in Typesense. - Next.js serves pages at
/<publication-slug>/<post-name>using metadata from Postgres and raw content from storage.
Directory structure
app/ # Next.js App Router pages
[publication]/ # Public publication pages
[post]/[[...slug]]/ # Post content renderer (dataset/notebook pages)
subscribe/ # Publication subscription page
api/ # API route handlers
auth/[...nextauth]/ # NextAuth endpoints
inngest/ # Inngest webhook handler
trpc/[trpc]/ # tRPC endpoint
webhook/ # GitHub webhook intake
generate/ # AI content generation
dashboard/ # Authenticated dashboard
admin/ # Admin panel
posts/ # Post management (new, edit, settings)
publications/ # Publication management
subscriptions/ # User subscription management
home/ # Marketing/landing pages
collections/ # Data collections browser
pricing/ # Pricing page
publish/ # Publish CTA page
solutions/ # Solution pages
login/ # Authentication page
user/[username]/ # Public user profile page
components/ # React components
auth/ # Authentication components
form/ # Form components
front/ # Public-facing site components
icons/ # Icon components
layouts/ # Layout components
preview/ # Content preview components
MDX.tsx # MDX compilation component
mdx-components-factory.tsx # MDX component mappings
middleware.ts # URL routing for multitenant app (rewrites, auth guards)
lib/ # Shared utilities
__tests__/ # Unit tests for lib functions
front/ # Frontend-specific utilities
hooks/ # React hooks
markdown.ts # Markdown processing pipeline
content-store.ts # S3/MinIO storage abstraction
github.ts # GitHub API client helpers
actions.ts # Server actions
app-config.ts # Application configuration loader
server/ # Server-side code
api/
routers/ # tRPC router definitions
__tests__/ # Router unit tests
post.ts # Post CRUD, sync, settings
publication.ts # Publication CRUD
like.ts # Like/unlike posts
subscription.ts # Publication subscriptions
user.ts # User profile management
home.ts # Homepage data (featured posts, collections)
root.ts # tRPC router composition
trpc.ts # tRPC context and middleware
auth.ts # NextAuth configuration
db.ts # Prisma client singleton
inngest/ # Background job definitions
client.ts # Inngest client and event types
functions.ts # Sync and delete function implementations
prisma/
schema.prisma # Database schema
migrations/ # Prisma migrations
e2e/ # End-to-end tests (Playwright)
renderer/ # Public page rendering tests
fixtures/ # Test fixtures and helpers
global-setup.ts # Test seeding and setup
Domain model
Entities defined in prisma/schema.prisma:
| Model | Purpose |
|---|---|
User | Authenticated user (GitHub OAuth) |
Publication | A named publication owned by a user (like a Substack) |
Post | A content item (dataset or Observable notebook) linked to a GitHub repo, belongs to a Publication |
Blob | A file record (metadata + sync status) belonging to a Post |
Like | User-to-Post like relationship |
PostStat | Daily views/downloads stats per Post |
PostAuthor | Many-to-many Post-to-User author relationship |
PublicationSubscription | User subscription to a Publication |
Account, Session, VerificationToken | NextAuth authentication models |
Key relationships:
User1->NPublication(owner)Publication1->NPostPost1->NBlobPostN<->NUser(viaPostAuthor)Post1->NLikePost1->NPostStatPublication1->NPublicationSubscription
erDiagram
User ||--o{ Publication : owns
User ||--o{ Like : gives
User ||--o{ PostAuthor : authors
User ||--o{ PublicationSubscription : subscribes
Publication ||--o{ Post : contains
Publication ||--o{ PublicationSubscription : has
Post ||--o{ Blob : has_files
Post ||--o{ Like : receives
Post ||--o{ PostStat : tracks
Post ||--o{ PostAuthor : has_authors
Post types (enum PostType):
DATAPACKAGE- standard markdown/data datasetOBSERVABLE- Observable Framework notebook
Major components
Authentication
- NextAuth config:
server/auth.ts - OAuth endpoints:
app/api/auth/[...nextauth]/route.ts - Login page:
app/login/page.tsx
Dashboard (authenticated)
- Layout:
app/dashboard/layout.tsx - Main dashboard:
app/dashboard/page.tsx - Publication management:
app/dashboard/publications/ - Post management:
app/dashboard/posts/ - Subscription management:
app/dashboard/subscriptions/page.tsx - Profile settings:
app/dashboard/settings/page.tsx - Admin panel:
app/dashboard/admin/page.tsx
Public rendering (publication/post pages)
- Publication index:
app/[publication]/page.tsx- lists posts in a publication - Post renderer:
app/[publication]/[post]/[[...slug]]/page.tsx- renders dataset/notebook content - Subscribe page:
app/[publication]/subscribe/page.tsx - User profile:
app/user/[username]/page.tsx - MDX compilation:
components/MDX.tsx - Markdown/MDX processing:
lib/markdown.ts - MDX component factory:
components/mdx-components-factory.tsx - Remark plugins (Obsidian wiki-links, embeds, callouts):
lib/remark-plugins.tsx
Content sync (Inngest)
- GitHub client:
lib/github.ts - Inngest client + events:
inngest/client.ts - Sync/delete functions:
inngest/functions.ts - GitHub webhook intake:
app/api/webhook/route.ts - Inngest handler:
app/api/inngest/route.ts
Storage
- S3/MinIO abstraction:
lib/content-store.ts - Raw content served via storage redirect
Homepage / marketing
All landing pages live in app/home/ and are served at the root domain via middleware rewrites (e.g. datahub.io/pricing -> app/home/pricing/page.tsx).
- Landing page:
app/home/page.tsx(served at/) - Pricing:
app/home/pricing/page.tsx(served at/pricing) - Publish CTA:
app/home/publish/page.tsx(served at/publish) - Collections:
app/home/collections/page.tsx(served at/collections) - Solutions:
app/home/solutions/page.tsx(served at/solutions)- Sub-pages:
logistics/,global-country-region-reference-data/,worldwide-postal-code-database/
- Sub-pages:
Middleware routing (middleware.ts)
The middleware handles all URL routing for the multitenant app:
/,/pricing,/collections,/solutions,/publish-> rewritten toapp/home/*/@username-> rewritten toapp/user/*/<publication>/<post>/_r/-/<file>-> rewritten to raw file API/<publication>/<post>/r/<file>-> rewritten to legacy raw file API/<publication>/<post>/datapackage.json-> rewritten to raw file API/dashboard/*-> auth guard (redirect to/loginif unauthenticated)/login-> redirect to/dashboardif already authenticated- Custom domains -> rewritten to
/_domain={hostname}/*
API surface
HTTP route handlers (app/api/)
| Route | File | Purpose |
|---|---|---|
/api/auth/[...nextauth] | app/api/auth/[...nextauth]/route.ts | NextAuth OAuth endpoints |
/api/trpc/[trpc] | app/api/trpc/[trpc]/route.ts | tRPC endpoint |
/api/webhook | app/api/webhook/route.ts | GitHub push webhook |
/api/inngest | app/api/inngest/route.ts | Inngest function handler |
/api/generate | app/api/generate/route.ts | AI content generation |
tRPC routers (via /api/trpc/[trpc])
Composed in server/api/root.ts:
| Router | File | Key procedures |
|---|---|---|
post | server/api/routers/post.ts | CRUD, sync, settings, tree management |
publication | server/api/routers/publication.ts | Create, update, delete publications |
like | server/api/routers/like.ts | Like/unlike posts, get like status |
subscriptions | server/api/routers/subscription.ts | Subscribe/unsubscribe to publications |
user | server/api/routers/user.ts | Profile management, avatar upload |
home | server/api/routers/home.ts | Homepage data, featured content |
Sync architecture (Inngest)
Configuration:
inngest/client.ts- Event definitions and client setupinngest/functions.ts- Sync and delete functions
graph TD
subgraph "Trigger Sources"
GH[GitHub Push] -->|Webhook| WH[Webhook Handler]
UI[Manual UI Trigger] -->|Direct| IE[Inngest Event]
WH -->|Validate & Filter| IE
end
subgraph "Event Processing"
IE -->|post/sync| SF[Sync Function]
IE -->|post/delete| DF[Delete Function]
end
Trigger points
- Automatic: GitHub push webhooks to the tracked branch, validated by webhook secret.
- Manual: user-initiated sync from dashboard with force option.
- Initial sync: on post creation, full repository processing.
Sync flow
- Fetch post details and publication/user info from Postgres.
- Update sync status to
PENDING. - Load post configuration, include/exclude patterns, validate root directory.
- Fetch current tree from storage and repo tree from GitHub API.
- Compare SHA hashes to detect changes (early exit if trees match).
- For changed files:
- Filter by supported extensions (.md, .json, .yaml).
- Apply include/exclude patterns.
- Upload raw file to R2/MinIO.
- Parse frontmatter, compute metadata (title, description, URL).
- Store/update
Blobmetadata in Postgres.
- Handle deletions (remove files from storage, clean metadata).
- Upload new tree, update sync status to
SUCCESS, revalidate Next.js cache tags.
Error handling
- Concurrency limit: 5 per account.
- Cancellation: on new sync or delete events for the same post.
- Non-retriable errors: invalid root dir, YAML parse errors, invalid datapackage format.
- Retriable errors: GitHub API rate limits, network timeouts, storage upload failures.
- Error messages stored in database, visible in dashboard.
Cloudflare Worker (external repo)
Repository: datahub-cloudflare-workers (cloned alongside this repo, default ../datahub-cloudflare-workers).
Purpose: process markdown uploads from R2/MinIO, parse frontmatter, update Blob metadata, and index content in Typesense.
- Consumes queue messages from R2 events (prod/staging) or MinIO webhook (dev at
/queue). - Skips non-markdown files, supports
publish: falseto delete content. - Storage key pattern:
{postId}/{branch}/raw/{path}
Queues:
- Dev:
markdown-processing-queue-dev - Staging:
datahub-markdown-queue-staging - Prod:
datahub-markdown-queue
External services
| Service | Purpose | Config location |
|---|---|---|
| PostgreSQL (Neon/Vercel) | Users, publications, posts, blobs, metadata | POSTGRES_PRISMA_URL in .env |
| R2 (Cloudflare) / MinIO | Raw markdown files, tree blobs | S3_* vars in .env |
| Typesense | Search indexing | Worker config |
| Inngest | Background sync/delete jobs | inngest/functions.ts |
| GitHub OAuth | Authentication | NEXT_PUBLIC_AUTH_GITHUB_ID, AUTH_GITHUB_SECRET |
| Posthog | Analytics | NEXT_PUBLIC_POSTHOG_KEY |
| Brevo | Email notifications | BREVO_* vars |
| Cloudflare Turnstile | Captcha | TURNSTILE_* vars |
Testing
Unit tests (Jest)
- Location:
lib/__tests__/,server/api/routers/__tests__/,lib/*.test.ts - Run:
pnpm test - Framework: Jest + Testing Library
E2E tests (Playwright)
- Location:
e2e/renderer/(public page tests) - Config:
playwright.config.ts - Global setup:
e2e/global-setup.ts(seeds test data via Inngest) - Run:
pnpm exec playwright test --project=renderer-chromium - See:
e2e/README.mdfor full guide
CI
- GitHub Actions workflows in
.github/workflows/:e2e.yml- E2E tests against Vercel preview deploymentslint.yml- ESLint checksunit.yml- Jest unit testssync-ga-views.yml- Google Analytics view sync