stubbi • Oct 7, 2025

Aqora’s Quantum Datasets Hub for Reproducible Benchmarks & QML

Discover, publish, and load quantum datasets with immutable versions, public/private access, and pandas/polars support. Built for reproducible QML.

Screenshot of Aqora Datasets Hub listing quantum datasets: QAOA/MaxCut graphs, VQE Hamiltonians, and OpenQASM circuits with version, license, and size.
Screenshot of Aqora Datasets Hub listing quantum datasets: QAOA/MaxCut graphs, VQE Hamiltonians, and OpenQASM circuits with version, license, and size.

Aqora's Quantum Datasets Hub for Reproducible Benchmarks & QML

The Aqora Datasets Hub gives quantum researchers and teams a fast place to publish, discover, and reuse datasets built for reproducible benchmarks and QML experiments. Bring graph sets for MaxCut/QAOA, molecular Hamiltonians for VQE, circuit corpora in OpenQASM 3, and their classical companions. Load with pandas or polars, keep datasets private to your organization or share them publicly, and cite immutable versions so results line up across machines and time.
DatasetTaskSizeLicenseVersion
HamLib Binary OptimizationHamiltonian Benchmarking, Binary Optimization1.29 GBCC-BY-SA-4.01.0.0
MNISQQuantum Machine Learning28.12 GBCC-BY-SA-4.01.0.0
Clinical Trial PBCPharmacy19.22 kBCC-BY-SA-4.01.0.0
Pharmacometric EventsPharmacy22.21 kBMIT1.0.0

Quick start

URI format: aqora://{publisher}/{name}/v{version}
Load a public Parquet split directly into pandas:
import pandas as pd
URI = "aqora://stubbi/pharmacometric-events/v1.0.0"
df = pd.read_parquet(URI)
print(df.head())
You can also load with polars and push down a column selection + row filter before data transfer:
import polars as pl
from aqora_cli.pyarrow import dataset
df = pl.scan_pyarrow_dataset(dataset("bernalde/hamlib-binary-optimization", "v1.0.0"))
small_max3sat = (
    df.filter(
        (pl.col("problem") == "max3sat") &
        (pl.col("n_qubits") < 100)
    )
    .select(["hamlib_id","collection","instance_name","n_qubits","n_terms","operator_format"])
    .collect()
)
print(small_max3sat.head(10))

Why a dedicated hub for quantum-relevant data?

In classical ML, standardized dataset structures and hub-first distribution unlocked discoverability and reproducibility; we're bringing that workflow to quantum. Quantum research leans on curated datasets: graph families for QAOA/MaxCut, molecular Hamiltonians for VQE, circuit corpora, and re-encoded image/time-series for QML. But sources are scattered, metadata is inconsistent, and version lineage is often unclear. That makes benchmark results hard to recreate and slows down reproducible results. The Aqora Datasets Hub formalizes the basics so you can focus on algorithms, not plumbing. Quantum-native artifacts like QASM circuits and Hamiltonians can directly be embedded into the dataset tables researchers can work with in their code base.

What's in v1

  • Immutable versions for every dataset
  • Public and private publishing for teams and open science
  • Clean metadata: license (SPDX), tags, version, and size
  • Simple loading with ready-to-copy Python snippets

Reproducible benchmarks and experiments

Benchmarks and experiments only matter if others can rerun them. An immutable remote hub ensures everyone pulls the same bytes every time without local path drifts or silent file edits. Whether you're sweeping QAOA depths, comparing VQE ansätze, or training QML kernels, the recipe is simple:
  1. Pin a fixed dataset version: publisher/name@version.
  2. Seed Python/NumPy and your QC stack.
  3. Log the dataset version and your code commit in every run.
# Minimal, reproducible setup
import os, random, numpy as np, pandas as pd
SEED = 42
URI  = "aqora://stubbi/pharmacometric-events/v1.0.0"   # pin a fixed version
os.environ["PYTHONHASHSEED"] = str(SEED)
random.seed(SEED); np.random.seed(SEED)
# Optional: align Qiskit randomness
try:
    from qiskit.utils import algorithm_globals
    algorithm_globals.random_seed = SEED
except Exception:
    pass
# Deterministic shuffle + split
df = pd.read_parquet(URI).sample(frac=1.0, random_state=SEED).reset_index(drop=True)
n = len(df); n_train = int(0.8*n); n_val = int(0.1*n)
train, val, test = df[:n_train], df[n_train:n_train+n_val], df[n_train+n_val:]
print({"uri": URI, "seed": SEED, "n": n})
# → proceed with Qiskit / PennyLane / Cirq, etc.
Best practice for papers/notebooks
  • Pin publisher/name@version in the script header.
  • Log metrics with the dataset version and code commit hash.
  • Publish your eval script + config; anyone can reproduce by loading the same version.

Publish your dataset

Turning a lab folder into a dataset page on Aqora gives it a durable, citable reference. With immutable versions and clean metadata, others can rerun and verify your results without emailing for files or chasing drive links. Keep it private while you iterate, then switch to public when the paper lands. Your citation (and URI) stay stable. This way, you get clear attribution and the community benefits from reproducable research. Follow these steps to create and publish your dataset on Aqora:
  1. Create the dataset at https://aqora.io/datasets/new
  2. Add a short README (purpose, limits, columns), a license (SPDX), and tags.
  3. Choose public or private visibility.
  4. Prepare data in CSV/Parquet/JSON. If you upload a directory, Aqora automatically generates a single Parquet file from it.
  5. Upload the data in one of two ways:
  • UI: Drag and Drop your files on the dataset page created in step 1.
  • CLI (example):
pip install -U aqora-cli
aqora login
aqora datasets upload alice/graphzoo-3reg --version 1.0.0 data.csv

FAQ

Is this only for "quantum" data?
No. Most experiments pair classical datasets (graphs, molecules, images, time series) with quantum artifacts. The hub treats both as first-class.
Do I need a special format?
Yes, we currently support CSV, Parquet, and JSON. If you upload a directory, we'll generate a single Parquet file from it.
Can I keep a dataset private to my lab?
Yes. Upload as private, share with your team, and switch to public when ready without changing the version you cite.
Do you issue DOIs per dataset or version?
Not yet. In the meantime, cite the immutable publisher/name@version. DOI support is on our roadmap.
What formats and sizes are supported?
CSV, Parquet, and JSON for tables. If you upload a directory, Aqora automatically generates a single Parquet file from it. Large datasets are supported. Use the CLI for multi-file or multi-GB uploads.
Order by:

Want to join this discussion?

Join our community today and start discussing with our members by participating in exciting events, competitions, and challenges. Sign up now to engage with quantum experts!