██╗ ██╗ ████████╗ ██╗
███╗ ███║ ██╔═══██║ ██║
████╗ ████║ ████████╗ ██║ ██║ ██║
██╔████╔██║ ██╔═══██║ ██║ ██║ ██║
██║╚██╔╝██║ ██║ ██║ ████████║ ██║
██║ ╚═╝ ██║ ████████║ ╚══════██║ ██║
╚═╝ ╚═╝ ╚═══════╝ ╚═╝ ╚═╝
Master AI for Market & Quantitative Investment
École Polytechnique
What is this?
MaQI is the data infrastructure for the Master "AI for Market and Quantitative Investment" at École Polytechnique. This repo gives every contributor — professor, student, research collaborator — a single place to find the datasets, read the documentation, and start working in 5 minutes.
What's inside:
| What | Where | Description |
|---|---|---|
| Data catalog | docs/providers/ | 28 providers inventoried, classified, with contacts and licensing status |
| Compute landscape | docs/compute/ | 7 compute vendors surveyed (GCP / AWS / Azure / Nebius / OVHcloud / S3NS / FluidStack) + cost model (who charges egress, 2024-2027 market shift) |
| Wasabi S3 storage | docs/wasabi/ | 6 buckets, ~2 TiB live, bucket-by-bucket structure and sync history |
| Colab notebooks | notebooks/ | Open in Colab in one click, read any dataset in 30 seconds with polars |
| Architecture decisions | docs/adr/ | Why things are set up this way |
| CAL reference docs | docs/cal/ | Materialization of Charles-Albert's Google Docs (datasets pipeline + tech constraints) |
Datasets on Wasabi S3
All data lives on Wasabi (eu-central-1, Amsterdam). Zero egress fees.
Each contributor gets their own credentials (access key + secret key).
Canonical source for sizes:
docs/wasabi/state.md(generated byscripts/wasabi-state.sh).
| Bucket | Source | What | Size |
|---|---|---|---|
maqi-gdelt | GDELT | Global events 1979–2025, daily | 48 GiB |
maqi-ravenpack | RavenPack Edge | News sentiment 2011–2025 | 249 GiB |
maqi-causalitylink | CausalityLink | Causal event graph, snapshot Aug 2021 | 187 GiB |
maqi-databento | Databento | NASDAQ ITCH L2 order book 2018–2025 | 1.43 TiB |
maqi-spglobal | S&P Global | Compustat, Transcripts, ESG, Panjiva (streaming) | ~3.75 TiB |
maqi | — | Test data (synthetic OHLCV) | 21 KiB |
Full details: docs/wasabi/ — state, structure, sync history, known anomalies.
Get started
Option A — Google Colab (nothing to install)
Open any notebook directly in Colab. You just need your Wasabi credentials.
Note: this is a private repo. In Colab, tick "Include private repos" and authorize GitHub access. See
docs/colab-setup.mdfor troubleshooting.
Option B — Local setup
git clone git@github.com:eserie/MaQI.git
cd MaQI
./setup.sh # creates .venv, installs polars + s3fs + deps
cp rclone.conf.example ~/.config/rclone/rclone.conf
# Edit: add your access_key_id + secret_access_key
./test-connection.sh # smoke test against Wasabi
Detailed platform guides (macOS / Linux / Windows / WSL):
docs/setup.md.
Quick data access (Python)
import polars as pl
storage_options = {
"endpoint_url": "https://s3.eu-central-1.wasabisys.com",
"aws_access_key_id": "...", # your Wasabi key
"aws_secret_access_key": "...", # your Wasabi secret
"region": "eu-central-1",
}
# GDELT — global events
df = pl.read_csv("s3://maqi-gdelt/1979.zip", storage_options=storage_options)
# Databento — NASDAQ order book
import databento as db
data = db.DBNStore.from_file("local-copy.dbn.zst")
df = pl.from_pandas(data.to_df().reset_index())
More patterns in notebooks/maqi-data-demo.ipynb.
Repo structure
docs/
cal/ CAL's Google Docs → markdown (resync: scripts/sync-cal-docs.sh)
providers/ catalog.yaml (machine-readable) + README.md + per-provider fiches
wasabi/ bucket state, sync history, data structure
adr/ architecture decision records (ADR-001 … ADR-002)
colab-setup.md how to use notebooks in Google Colab
wasabi/anomalies.md quality report (GDELT 3 files, RavenPack 2020 missing)
notebooks/
maqi-colab-setup.ipynb first-run credentials + viz
test-buckets-access.ipynb verify all buckets
maqi-data-demo.ipynb full demo: polars load from each dataset with polars
scripts/
sync-cal-docs.sh pull CAL's gdrive docs → docs/cal/
wasabi-state.sh snapshot bucket sizes
Team
| Role | Person |
|---|---|
| Program lead | Charles-Albert Lehalle |
| Program coordinator | Vianney Perchet (ENSAE/CREST, Master MVA) |
| Infrastructure + scientific collaboration | Emmanuel Sérié |
| Data engineering | Wissal Efdaoui |
Key links
- Provider catalog — who we buy from, who we negotiate with
- Compute landscape — where notebooks / GPUs could run
- Wasabi state — what's on S3 right now
- Data quality report — known gaps
- ADR index — architecture decisions
- GitHub Issues — anomaly + task tracker
- Colab setup — notebook access guide
Rules
- No raw data in this repo. Data lives on Wasabi. Repo stores schemas, summaries, notebooks.
- No credentials.
rclone.conf,.env, API keys are gitignored. Credentials are distributed out-of-band. - No MFA on Wasabi. Multi-factor auth blocks S3 API access from Colab. Use strong passwords + key rotation.
- Non-redistribution. Paid feeds (S&P Global, Databento, RavenPack) are under academic license. Never extract raw content into the repo or share outside the program.