NPM package wrapper for our core datasets

Situation

We have around 150 datasets in our "datasets" organisation. We want to make it easier for people to integrate this data in their codebases.

Requirements

Always serve the latest data (no package rebuild required): Users always access the latest data. Data updates do not require republishing and upgrading the package.
Ease of integration and modularity: Users can integrate datasets with minimal setup and can import only the code they need (e.g. only related to data fetching of a single dataset).
Monetization and authentication: Restrict access to premium datasets or additional, premium data within existing datasets.

Overview

Single npm package with modules: Single npm package, e.g. datasets, with separate modules for each dataset.
Serve data from R2: Serve data from R2 storage (separate from DHC one) rather than directly from GitHub to manage data access and enable authentication. Create a workflow for syncing latest data in GitHub to R2.“
Implement authentication mechanism: Introduce an API key or token-based authentication system to control access to premium data. (Probably second iteration.)
Multiple R2 Buckets: Configure multiple R2 buckets to segregate free and premium datasets.

R2 buckets for free and premium datasets

Categorization (2 separate buckets):

Free datasets (or free datasets versions): Accessible without authentication. Hosted in the free R2 bucket, e.g. "datasets-free".
Premium datasets: Require an API key. Hosted in the premium R2 bucket with access controls, e.g. "datasets-premium".

(Probably second iteration and we'd start with free datasest:) Access control implementation:

R2 bucket policies: Set up bucket policies to require API keys for the premium bucket.
API key verification: The server serving R2 data should verify the API key and subscription level before granting access.

Possible bucket structure (same for free and premium):

/{dataset-repo-name}/{last-sync-date | "latest"}/data/{...data-files-from-data-folder}

# Example
#
# /currency-codes
#   /latest
#     /data/codes-all.csv
#   /2024-10-25
#     /data/codes-all.csv
#   /2024-09-11
#     /data/codes-all.csv
# /oil-prices
#   /latest
#     /data
#       brent-daily.csv
#       ...
#   /2024-10-31
#     /data
#       brent-daily.csv
#       ...

🚧 What about postal-codes ? Do we keep them in a separate?

Datasets repositories adjustments

Ensure that each dataset repository has a consistent structure, specifically all data files are in /data folder.
Create a reusable workflow in the npm datapackage's repository (or in a new repo in the datasets organisation) that will handle syncing the data from GitHub repositories to R2 buckets.
Trigger the reusable workflow on any data update (change to its /data folder on main branch) in any datasets repository.

🚧 How does the workflow know if sync to free or premium bucket. We'd need e.g. is_premium field in the datapackage.

API key management system implementation

🚧 TBD

1. Organize the package repository structure

Create a single npm package, e.g. datasets-js, and structure it to include a separate module for each dataset.

datasets/
├── package.json
├── README.md
├── index.js
├── datasets/
│   ├── airportCodes.js
│   ├── oilPrices.js
│   ├── breastCancerData.js
│   └── ... (other datasets)
└── utils/
    └── BaseDataset.js

2. Create a base class for data fetching

Implement a base class that handles data fetching, caching, and other common functionalities.
Allow users to configure things like:
- API key,
- cache duration,
- data refresh intervals.

// utils/BaseDataset.js
const axios = require('axios');

class BaseDataset {
  constructor(dataUrl, options = {}) {
    this.dataUrl = dataUrl;
    this.cacheDuration = options.cacheDuration || 24 * 60 * 60 * 1000; // Default: 24 hours
    this.apiKey = options.apiKey || null;
    this.data = null;
    this.lastUpdated = null;
  }

  async fetchData() {
    const now = Date.now();
    if (this.data && now - this.lastUpdated < this.cacheDuration) {
      return this.data;
    }

    try {
      const headers = this.apiKey ? { 'X-API-Key': this.apiKey } : {};
      const response = await axios.get(this.dataUrl, { headers });
      this.data = response.data; // Basic in-memory data caching
      this.lastUpdated = now;
      return this.data;
    } catch (error) {
      console.error(`Failed to fetch data from ${this.dataUrl}:`, error.message);
      throw error;
    }
  }
}

module.exports = BaseDataset;

3. Implement dataset modules

Create a module for each dataset that extends the base class.

// datasets/airportCodes.js
const BaseDataset = require('../utils/BaseDataset');

class AirportCodes extends BaseDataset {
  constructor(options = {}) {
    super(
      'https://r2.yourdomain.com/free/airport-codes.json',
      options
    );
  }

  async getByName(name) {
    const data = await this.fetchData();
    return data.find(item => item.name === name);
  }

  async getByIATA(code) {
    const data = await this.fetchData();
    return data.find(item => item.iata_code === code);
  }

  async getAll() {
    return await this.fetchData();
  }
}

module.exports = AirportCodes;

4. Export Modules in the main entry point

// index.js
module.exports = {
  AirportCodes: require('./datasets/airportCodes'),
  OilPrices: require('./datasets/oilPrices'),
  BreastCancerData: require('./datasets/breastCancerData'),
  // ... other datasets
};

5. Allow direct import of individual datasets

To enable users to import datasets directly, configure your package.json to include exports for each module. By allowing direct imports, users include only the code they need.

{
  "name": "@datasets/datasets",
  "version": "1.0.0",
  "main": "index.js",
  "exports": {
    "./airportCodes": "./datasets/airportCodes.js",
    "./oilPrices": "./datasets/oilPrices.js",
    "./breastCancerData": "./datasets/breastCancerData.js",
    // ... other datasets
    ".": "./index.js"
  },
  // ... other fields
}

Usage in user's code

Accessing a free dataset (or a free version of a dataset):

// Using ES Modules
import { AirportCodes } from '@datasets/datasets';

(async () => {
  const airportCodes = new AirportCodes();

  const airport = await airportCodes.getByName('Warsaw Chopin Airport');
  console.log(airport);
})();

Accessing a premium dataset (or a premium version of a dataset):

// Using ES Modules
import { PostalCodes } from '@datasets/datasets';

(async () => {
  const options = {
    apiKey: 'your-api-key', // Required for premium datasets
    cacheDuration: 60 * 60 * 1000, // Optional cache duration
  };
  const postalCodes = new PostalCodes(options);

  const postalCodeInfo = await postalCodes.getByPostalCode('10001');
  console.log(postalCodeInfo);
})();

Notes from Anu

Job stories based on high level understanding of what our client F needs and prospects such as D and Sh:

When building internal products, I want to have reference data in CSV format so that I can easily import it into my system. [This is very high level and we already can provide it.]
When building a product for country x, I want to have reference data with administrative divisions and associated postal codes so that I can use it for my own purposes (we don't know exactly how they use it).
When building a product for country x, I want to have reference data with all essential information about the country so that I can use it for my product.
- Country codes in multiple standards, e.g., 2 letter code (ISO 3166-1) and 3 letter code;
- Country full name in English;
- Telephone code;
- Currency code;
- Domain code;
- … about 30 different values per country based on geopolitical dataset we provide for them.
- Timezones data for the country;
  - note some countries have multiple timezones;
  - some countries have daylaight saving times observed only in certain regions within a country etc.
- Holidays data for the country;
  - some holidays are observed only in certain regions within a country;
- Postal codes data for the country;
  - can be on different levels, e.g., take Canada, where we can have city level postal codes but also street level postal codes. Some users want to have aggregated data on city level.

We want to provide both:

Data API with SDKs in JS and Python for users who want to consume APIs and integrate into their system.
On the other hand, I have clients who only requires access to the entire bucket with raw data so that they can batch export csv files and load into their system for their internal purposes.

I asked chat gpt what are we building based on these job stories:

In this scenario, you’re offering a dual-access geo-referenced data platform designed to meet diverse client needs. This setup provides both developer-focused, API-driven access with SDKs and direct access to raw data for clients requiring bulk downloads. Here’s a breakdown of how each option works within your platform:

1. Data API with SDKs in JavaScript and Python

Developer-Friendly Integration: The API, paired with SDKs in JavaScript and Python, is ideal for developers who need flexible, programmatic access to reference data. It allows them to query and retrieve data in a structured way, integrating directly with applications or systems in real-time.
Flexible and Efficient Data Retrieval: By using the API, clients can fetch only the data they need, when they need it, rather than handling full data sets. This approach reduces data overhead and provides more granular, just-in-time data access.
SDK Benefits: The SDKs abstract API complexity, making it easier for developers to integrate, authenticate, and handle data. They also allow for error handling, caching, and potentially additional features (e.g., offline access or local caching) that further enhance the developer experience.
Use Cases: Perfect for clients needing location-based data to power live applications, analytics, or for tasks requiring real-time data synchronization.

2. Bulk Data Access to the Entire Bucket

Direct Access for Bulk Data Needs: For clients who need complete datasets on a regular basis, you provide access to the entire data storage (e.g., blob storage bucket) where they can export data in bulk.
Regular Batch Exports: These clients can perform periodic downloads to keep their systems up-to-date without relying on API calls, making it ideal for environments where data latency is less critical but comprehensive data coverage is essential.
Standardized File Formats: By offering data in CSV or similar formats, this option enables easy import into various systems for ETL (Extract, Transform, Load) workflows, internal data warehousing, or business intelligence tasks.
Use Cases: Suitable for clients who want to maintain their own data infrastructure and update their databases periodically, avoiding dependencies on API availability or usage limits.

Summary: What You’re Building

This dual-access model creates a Geo-Referenced Data Platform with Flexible Access Options:

The Data API with SDKs provides real-time, on-demand access and is ideal for application integration and live systems.
Direct Bucket Access for Raw Data offers full, downloadable datasets for batch processing and bulk data management.

By catering to both interactive, developer-friendly use cases and bulk data needs, this platform provides flexibility that can suit a wide range of client workflows and technical capabilities.