From https://chatgpt.com/g/g-p-69456b5516d88191a86c8311ce28592d-substack-for-data/shared/c/696230b9-d7c0-8327-9186-eeb4ca470b5c?owner_user_id=user-NwfJc4WN4M2O0Lnz6rdPWGot

Current https://datahub.io/blog page looks like this beause we aren't setting titles from the markdown …

Image

Goal

A post (or dataset/dashboard page) should reliably have: title, short description, featured image, and body content—without requiring config.json, and with a path to optional UI edits.

V1 principle

Treat README.md / index.md front matter as the single source of truth, and make the UI read-only by default (with an “Override metadata” escape hatch you can add later).

Content extraction contract

Require (or strongly encourage) YAML front matter at the top of README.md / index.md:

---
title: "My post title"
description: "One-sentence tagline shown in cards and headers."
image: "assets/cover.png"   # or absolute URL
slug: "my-post-title"       # optional
type: "post"                # optional: post|dataset|dashboard
---

Fallbacks if front matter is missing:

  • title: first Markdown H1 (# ...), else repo name / directory name
  • description: explicit description, else first non-empty paragraph (trim to N chars)
  • image: explicit image, else first “hero candidate” image in the document (or none)
  • slug: explicit slug, else slugify(title)

Precedence rules

Define deterministic precedence so sync is predictable:

  1. UI override fields (if enabled) win.
  2. Front matter fields.
  3. Heuristic extraction (H1 / first paragraph / first image).
  4. Repository/directory fallback.

Also persist source_of_truth per field (override|frontmatter|extracted|fallback) so listings and editors can explain where values came from.

Data model changes

Rename “site” to “post” internally (or introduce a Post entity and alias the old name), but the key is a stable metadata table for listing performance and search.

Minimal fields:

  • id, publication_id, type
  • repo_url, repo_ref (branch/commit), content_path (e.g. /README.md)
  • title, description, featured_image_url
  • slug, canonical_url
  • metadata_json (raw front matter + extracted signals)
  • override_json (nullable)
  • updated_at, synced_at, source_hash

Sync lifecycle

When ingesting from GitHub:

  1. Resolve which file is the “main post file” (convention: README.md at root, else index.md in a folder).
  2. Parse front matter + markdown.
  3. Compute derived fields + slug.
  4. Store metadata into DB for fast listing and routing.
  5. Store referenced images into your object store (or proxy via CDN) to avoid broken relative paths.

Important: keep the raw parsed front matter in metadata_json so you can extend without migrations.

Support three cases:

  • relative path in repo: resolve against the Markdown file directory; upload/cache to object storage; store stable URL
  • absolute URL: accept; optionally proxy/cache for resilience
  • no image: render a neutral placeholder (or publication default)

URL and slug strategy

For V1, slugify(title) on first successful sync and then freeze it (to avoid breaking links when titles change). If a user later supplies slug in front matter, treat it as an explicit migration (either redirect old slug → new slug, or require manual confirmation before changing).

Practical rule:

  • slug is immutable unless explicitly set (front matter or UI override) and you emit redirects.

UI experience options

V1 (lowest friction; recommended):

  • No metadata editing in UI.
  • UI shows extracted metadata with “Edit in README” hints (e.g., “Title comes from front matter. Edit README to change.”).
  • A “Copy front matter template” helper button is useful.

V1.5 (still simple, but pragmatic for your team):

  • Allow editing only title, description, featured image, and “pinned” status in UI as overrides.
  • Show a toggle: “Override README metadata (advanced)”.
  • Store overrides separately; never write back to GitHub in V1.

Later (full authoring):

  • UI editor that can write back to GitHub via PR/commit, but that’s a separate product decision.

Creating a post flow

If you want “just pull from GitHub and that’s it”:

  1. User connects repo (or selects a folder within a mono-repo).
  2. System auto-detects main file (README.md / index.md).
  3. System syncs and immediately shows the published page.
  4. UI indicates what to change in GitHub to update metadata.

Publication-level config

Move anything that looks like “site config” to the publication:

  • publication title/description/logo/default social card image
  • theme choices
  • default license / contact
  • canonical domain settings

Posts should only carry what’s needed to render and list that post.

A few concrete questions that determine the exact spec

  1. For dashboards specifically: is the “main file” a Markdown wrapper around an embedded dashboard, or is the dashboard itself the primary artifact with Markdown optional?
  2. Do you want stable slugs across title changes without requiring redirects infrastructure, or is implementing redirects acceptable early?
  3. Should posts be allowed inside a single repo as multiple “entries” (folder-per-post), or is it one repo = one post for now?

1. The Three Content Types

  • Dataset:

    • Primary Metadata: datapackage.json or datapackage.yml (Standard Frictionless Data specs).
    • Allow config.json only if you have view-specific metadata that doesn't fit the Frictionless spec.
  • Dashboard:

    • Metadata: config.json. Since these are often JavaScript/Observable apps, a JSON config is the native way to handle parameters.
  • Post:

    • Metadata: Frontmatter (YAML at the top of Markdown).

2. The Complication: UI Editing vs. Source Files

We identified the core tension: If a user changes a title in the Dashboard UI, and we save it to the DB, the DB and the File (Git/Storage) are now out of sync.

Here are the three standard architectural patterns to solve this, ranging from easiest to hardest:

Option A: The "Overlay" Strategy (DB Priority)

How it works: The database is the "master" for the presentation layer. The file is just the "seeder."

  1. On deployment/scan, we read the file metadata and populate the DB.
  2. If the user edits metadata in the UI, we update the DB column.
  3. The Rule: When rendering the page, check the DB first. If the DB has a value (e.g., title), use it. If the DB is empty (or flagged as 'synced'), fall back to the file content.
  • Pros: Fast UI interactions; easiest to implement.
  • Cons: The file in the repository becomes "stale." If a developer edits the file later, you need a logic to decide if the file should overwrite the DB's manual changes (Last-Write-Wins).

Option B: The "Git-Backed" Strategy (True Sync)

How it works: The UI acts as a Git client.

  1. User updates the title in the UI.
  2. The backend does not just update a DB column; it triggers a commit to the underlying file (e.g., updates the frontmatter in the Markdown file via GitHub API or local FS).
  3. The DB is updated only after the file change is processed.
  • Pros: Single Source of Truth is always the file. No "drift" between DB and File.
  • Cons: Slow (waiting for commits/CI); complex error handling (what if the commit fails?); risk of merge conflicts.

Option C: The "Read-Only / Fork" Strategy

How it works: You strictly separate "Publisher" (Code/File) mode from "Editor" (UI) mode.

  1. Files are read-only in the UI.
  2. If a user wants to edit metadata in the UI, they must "detach" or "eject" the content, effectively telling the system "Ignore the file for this field, I am taking over manually."

3. Recommendation

We are proceeding with the Database Overlay approach (Option A).

Here is the reasoning and the implementation plan:

The "Why": User Experience vs. Legacy We need to be realistic about our user base. Most of our future users will treat this as a SaaS platform; they expect that when they update a title or description in the UI, it updates immediately. They do not want to manage a Git workflow, edit YAML files, or understand the concept of a commit.

The strict "File-as-Source-of-Truth" approach is primarily a requirement for our own internal legacy content and developer-centric workflows. It should not dictate the architecture for the general user experience.

The "How": Implementation Logic

  1. Ingestion (The "Seeder"): When a project is first connected or deployed, we read the metadata from the files (Frontmatter, datapackage.json, config.json) and populate the database metadata column.
  2. Editing (DB First): When a user edits metadata via the Dashboard UI, we write those changes directly to the Database. We will not attempt to write back to the Git repository or file system.
  3. Display (Precedence): The Database is the single source of truth for the application UI. When rendering the Publication Homepage or sorting lists:
  • We query the Database.
  • If the DB has the metadata, we use it.
  • Files are treated essentially as a "backup" or an initial import mechanism, not the live state.

This simplifies our sorting logic significantly (as agreed, we just query the JSONB column) and removes the complexity of trying to keep Git and DB in perfect bidirectional sync.