stubbi • Oct 7, 2025
Aqora’s Quantum Datasets Hub for Reproducible Benchmarks & QML
Discover, publish, and load quantum datasets with immutable versions, public/private access, and pandas/polars support. Built for reproducible QML.

Aqora's Quantum Datasets Hub for Reproducible Benchmarks & QML
The Aqora Datasets Hub gives quantum researchers and teams a fast place to publish, discover, and reuse datasets built for reproducible benchmarks and QML experiments. Bring graph sets for MaxCut/QAOA, molecular Hamiltonians for VQE, circuit corpora in OpenQASM 3, and their classical companions. Load with pandas or polars, keep datasets private to your organization or share them publicly, and cite immutable versions so results line up across machines and time.
Featured datasets you can use today
Dataset | Task | Size | License | Version |
---|---|---|---|---|
HamLib Binary Optimization | Hamiltonian Benchmarking, Binary Optimization | 1.29 GB | CC-BY-SA-4.0 | 1.0.0 |
MNISQ | Quantum Machine Learning | 28.12 GB | CC-BY-SA-4.0 | 1.0.0 |
Clinical Trial PBC | Pharmacy | 19.22 kB | CC-BY-SA-4.0 | 1.0.0 |
Pharmacometric Events | Pharmacy | 22.21 kB | MIT | 1.0.0 |
Quick start
URI format: aqora://{publisher}/{name}/v{version}
Load a public Parquet split directly into pandas:
import pandas as pd
URI = "aqora://stubbi/pharmacometric-events/v1.0.0"
df = pd.read_parquet(URI)
print(df.head())
You can also load with polars and push down a column selection + row filter before data transfer:
import polars as pl
from aqora_cli.pyarrow import dataset
df = pl.scan_pyarrow_dataset(dataset("bernalde/hamlib-binary-optimization", "v1.0.0"))
small_max3sat = (
df.filter(
(pl.col("problem") == "max3sat") &
(pl.col("n_qubits") < 100)
)
.select(["hamlib_id","collection","instance_name","n_qubits","n_terms","operator_format"])
.collect()
)
print(small_max3sat.head(10))
Why a dedicated hub for quantum-relevant data?
In classical ML, standardized dataset structures and hub-first distribution unlocked discoverability and reproducibility; we're bringing that workflow to quantum.
Quantum research leans on curated datasets: graph families for QAOA/MaxCut, molecular Hamiltonians for VQE, circuit corpora, and re-encoded image/time-series for QML. But sources are scattered, metadata is inconsistent, and version lineage is often unclear. That makes benchmark results hard to recreate and slows down reproducible results.
The Aqora Datasets Hub formalizes the basics so you can focus on algorithms, not plumbing. Quantum-native artifacts like QASM circuits and Hamiltonians can directly be embedded into the dataset tables researchers can work with in their code base.
What's in v1
- Immutable versions for every dataset
- Public and private publishing for teams and open science
- Clean metadata: license (SPDX), tags, version, and size
- Simple loading with ready-to-copy Python snippets
Reproducible benchmarks and experiments
Benchmarks and experiments only matter if others can rerun them. An immutable remote hub ensures everyone pulls the same bytes every time without local path drifts or silent file edits. Whether you're sweeping QAOA depths, comparing VQE ansätze, or training QML kernels, the recipe is simple:
- Pin a fixed dataset version: publisher/name@version.
- Seed Python/NumPy and your QC stack.
- Log the dataset version and your code commit in every run.
# Minimal, reproducible setup
import os, random, numpy as np, pandas as pd
SEED = 42
URI = "aqora://stubbi/pharmacometric-events/v1.0.0" # pin a fixed version
os.environ["PYTHONHASHSEED"] = str(SEED)
random.seed(SEED); np.random.seed(SEED)
# Optional: align Qiskit randomness
try:
from qiskit.utils import algorithm_globals
algorithm_globals.random_seed = SEED
except Exception:
pass
# Deterministic shuffle + split
df = pd.read_parquet(URI).sample(frac=1.0, random_state=SEED).reset_index(drop=True)
n = len(df); n_train = int(0.8*n); n_val = int(0.1*n)
train, val, test = df[:n_train], df[n_train:n_train+n_val], df[n_train+n_val:]
print({"uri": URI, "seed": SEED, "n": n})
# → proceed with Qiskit / PennyLane / Cirq, etc.
Best practice for papers/notebooks
- Pin
publisher/name@version
in the script header. - Log metrics with the dataset version and code commit hash.
- Publish your eval script + config; anyone can reproduce by loading the same version.
Publish your dataset
Turning a lab folder into a dataset page on Aqora gives it a durable, citable reference. With immutable versions and clean metadata, others can rerun and verify your results without emailing for files or chasing drive links. Keep it private while you iterate, then switch to public when the paper lands. Your citation (and URI) stay stable. This way, you get clear attribution and the community benefits from reproducable research. Follow these steps to create and publish your dataset on Aqora:
- Create the dataset at https://aqora.io/datasets/new
- Add a short README (purpose, limits, columns), a license (SPDX), and tags.
- Choose public or private visibility.
- Prepare data in CSV/Parquet/JSON. If you upload a directory, Aqora automatically generates a single Parquet file from it.
- Upload the data in one of two ways:
- UI: Drag and Drop your files on the dataset page created in step 1.
- CLI (example):
pip install -U aqora-cli
aqora login
aqora datasets upload alice/graphzoo-3reg --version 1.0.0 data.csv
FAQ
Is this only for "quantum" data?
No. Most experiments pair classical datasets (graphs, molecules, images, time series) with quantum artifacts. The hub treats both as first-class.
Do I need a special format?
Yes, we currently support CSV, Parquet, and JSON. If you upload a directory, we'll generate a single Parquet file from it.
Can I keep a dataset private to my lab?
Yes. Upload as private, share with your team, and switch to public when ready without changing the version you cite.
Do you issue DOIs per dataset or version?
Not yet. In the meantime, cite the immutable
publisher/name@version
. DOI support is on our roadmap.What formats and sizes are supported?
CSV, Parquet, and JSON for tables. If you upload a directory, Aqora automatically generates a single Parquet file from it. Large datasets are supported. Use the CLI for multi-file or multi-GB uploads.