View as:
                                                                          
    ██╗     ██╗            ████████╗  ██╗
    ███╗   ███║            ██╔═══██║  ██║
    ████╗ ████║  ████████╗ ██║   ██║  ██║
    ██╔████╔██║  ██╔═══██║ ██║   ██║  ██║
    ██║╚██╔╝██║  ██║   ██║ ████████║  ██║
    ██║ ╚═╝ ██║  ████████║ ╚══════██║  ██║
    ╚═╝     ╚═╝  ╚═══════╝        ╚═╝  ╚═╝
                                                                          
    Master AI for Market & Quantitative Investment
    École Polytechnique

What is this?

MaQI is the data infrastructure for the Master "AI for Market and Quantitative Investment" at École Polytechnique. This repo gives every contributor — professor, student, research collaborator — a single place to find the datasets, read the documentation, and start working in 5 minutes.

What's inside:

WhatWhereDescription
Data catalogdocs/providers/28 providers inventoried, classified, with contacts and licensing status
Compute landscapedocs/compute/7 compute vendors surveyed (GCP / AWS / Azure / Nebius / OVHcloud / S3NS / FluidStack) + cost model (who charges egress, 2024-2027 market shift)
Wasabi S3 storagedocs/wasabi/6 buckets, ~2 TiB live, bucket-by-bucket structure and sync history
Colab notebooksnotebooks/Open in Colab in one click, read any dataset in 30 seconds with polars
Architecture decisionsdocs/adr/Why things are set up this way
CAL reference docsdocs/cal/Materialization of Charles-Albert's Google Docs (datasets pipeline + tech constraints)

Datasets on Wasabi S3

All data lives on Wasabi (eu-central-1, Amsterdam). Zero egress fees. Each contributor gets their own credentials (access key + secret key).

Canonical source for sizes: docs/wasabi/state.md (generated by scripts/wasabi-state.sh).

BucketSourceWhatSize
maqi-gdeltGDELTGlobal events 1979–2025, daily48 GiB
maqi-ravenpackRavenPack EdgeNews sentiment 2011–2025249 GiB
maqi-causalitylinkCausalityLinkCausal event graph, snapshot Aug 2021187 GiB
maqi-databentoDatabentoNASDAQ ITCH L2 order book 2018–20251.43 TiB
maqi-spglobalS&P GlobalCompustat, Transcripts, ESG, Panjiva (streaming)~3.75 TiB
maqiTest data (synthetic OHLCV)21 KiB

Full details: docs/wasabi/ — state, structure, sync history, known anomalies.

Get started

Option A — Google Colab (nothing to install)

Open any notebook directly in Colab. You just need your Wasabi credentials.

NotebookDescription
maqi-colab-setup.ipynbFirst-run setup + smoke testOpen in Colab
test-buckets-access.ipynbVerify access to all 6 bucketsOpen in Colab
maqi-data-demo.ipynbFull demo: read a sample from every dataset with polarsOpen in Colab

Note: this is a private repo. In Colab, tick "Include private repos" and authorize GitHub access. See docs/colab-setup.md for troubleshooting.

Option B — Local setup

git clone git@github.com:eserie/MaQI.git
cd MaQI
./setup.sh                          # creates .venv, installs polars + s3fs + deps

cp rclone.conf.example ~/.config/rclone/rclone.conf
# Edit: add your access_key_id + secret_access_key

./test-connection.sh                # smoke test against Wasabi

Detailed platform guides (macOS / Linux / Windows / WSL): docs/setup.md.

Quick data access (Python)

import polars as pl

storage_options = {
    "endpoint_url": "https://s3.eu-central-1.wasabisys.com",
    "aws_access_key_id": "...",        # your Wasabi key
    "aws_secret_access_key": "...",    # your Wasabi secret
    "region": "eu-central-1",
}

# GDELT — global events
df = pl.read_csv("s3://maqi-gdelt/1979.zip", storage_options=storage_options)

# Databento — NASDAQ order book
import databento as db
data = db.DBNStore.from_file("local-copy.dbn.zst")
df = pl.from_pandas(data.to_df().reset_index())

More patterns in notebooks/maqi-data-demo.ipynb.

Repo structure

docs/
  cal/               CAL's Google Docs → markdown (resync: scripts/sync-cal-docs.sh)
  providers/         catalog.yaml (machine-readable) + README.md + per-provider fiches
  wasabi/            bucket state, sync history, data structure
  adr/               architecture decision records (ADR-001 … ADR-002)
  colab-setup.md     how to use notebooks in Google Colab
  wasabi/anomalies.md quality report (GDELT 3 files, RavenPack 2020 missing)
notebooks/
  maqi-colab-setup.ipynb       first-run credentials + viz
  test-buckets-access.ipynb    verify all buckets
  maqi-data-demo.ipynb         full demo: polars load from each dataset with polars
scripts/
  sync-cal-docs.sh             pull CAL's gdrive docs → docs/cal/
  wasabi-state.sh              snapshot bucket sizes

Team

RolePerson
Program leadCharles-Albert Lehalle
Program coordinatorVianney Perchet (ENSAE/CREST, Master MVA)
Infrastructure + scientific collaborationEmmanuel Sérié
Data engineeringWissal Efdaoui

Key links

Rules