HamLib Binary Optimization – Quantum Computing Dataset

bernalde / HamLib Binary Optimization

Public

About dataset version

HamLib: Binary Optimization Benchmark Suite

This subset of HamLib aggregates Hamiltonians encoding binary optimization problems (e.g., Max-3SAT / SATLIB families and related generators). It targets benchmarking of quantum (QAOA, VQE-style) and classical solvers, enabling reproducible comparisons across structured and random instances and sizes.
Typical uses:

Compare solver quality and runtime across instance families and sizes
Study scaling and instance hardness
Extract features (graph/term statistics) for meta-learning or portfolio schedulers

Quick start

Polars (streaming from Aqora)

import polars as pl
from aqora_cli.pyarrow import dataset
df = pl.scan_pyarrow_dataset(
    dataset("bernalde/hamlib-binary-optimization", "v1.0.0")
).collect()

Pandas (Aqora URI)

import pandas as pd
df = pd.read_parquet("aqora://bernalde/hamlib-binary-optimization/v1.0.0")

Replace the version with the one you intend to use.

Schema

One row = one Hamiltonian instance.

Column	Type	Description
`hamlib_id`	string	Stable 16-char ID derived from domain/problem/collection/file/entry.
`domain`	string	Dataset domain; for this subset typically `binaryoptimization`.
`problem`	string	Problem family (e.g., `max3sat`).
`collection`	string	Source bucket (e.g., `satlib`, `random_instances`).
`instance_name`	string	Original archive-derived instance label.
`operator_format`	string	Encoding of the Hamiltonian payload (e.g., `qiskit_sparse_pauli`, `raw_text`, `graph`, `clauses`).
`payload`	struct	Nested representation of the instance; see Payload.
`n_qubits`	int32	Number of qubits/variables inferred from the operator/attrs.
`n_terms`	int32	Number of operator terms (linear + pairwise) when available.
`one_norm`	float64	Sum of absolute coefficients when available.

Payload

payload is a nested Arrow/Parquet struct with optional subfields. Only the relevant subfields are populated per row based on operator_format.

payload: {
  paulis:       list<string>              # e.g., ["IZX...","ZII..."] for sparse-Pauli form
  coeffs:       list<float64>             # coefficient per pauli label
  terms:        list<struct{
                   term:  list<struct{qubit:int32, pauli:string}>
                   coeff: float64
                 }>                        # explicit operator terms (alternative representation)
  text:         string                     # raw text operator when structured parse isn't available
  graph_edges:  list<struct{u:int32, v:int32, w:float64}>   # optional graph context
  clauses:      list<list<int32>>          # optional CNF-like clauses for SAT/MaxSAT contexts
}

Mapping by format:

qiskit_sparse_pauli → paulis, coeffs
raw_text → text
graph → graph_edges
clauses → clauses

Examples

Filter and inspect (Polars)

import polars as pl
from aqora_cli.pyarrow import dataset
df = pl.scan_pyarrow_dataset(dataset("bernalde/hamlib-binary-optimization", "v1.0.0"))
small_max3sat = (
    df.filter(
        (pl.col("problem") == "max3sat") &
        (pl.col("n_qubits") < 100)
    )
    .select(["hamlib_id","collection","instance_name","n_qubits","n_terms","operator_format"])
    .collect()
)
print(small_max3sat.head(10))

Work with sparse-Pauli form (Pandas)

import pandas as pd
df = pd.read_parquet("aqora://bernalde/hamlib-binary-optimization/v1.0.0")
row = df[df["operator_format"]=="qiskit_sparse_pauli"].iloc[0]
paulis = row["payload"]["paulis"]
coeffs = row["payload"]["coeffs"]
print("n_terms:", len(paulis), "sample:", list(zip(paulis[:3], coeffs[:3])))

Raw-text fallback

row = df[df["operator_format"]=="raw_text"].iloc[0]
print(row["payload"]["text"][:400])

Graph / clauses context (if present)

g = df[df["operator_format"]=="graph"].iloc[0]["payload"]["graph_edges"]
print("first 5 edges:", g[:5])
c = df[df["operator_format"]=="clauses"].iloc[0]["payload"]["clauses"]
print("first clause:", c[0])

Notes and best practices

Use operator_format to choose the correct parsing path for each row.
Prefer streaming (scan_pyarrow_dataset) to filter and project before materializing.
For fair comparisons, stratify by problem, collection, and size (n_qubits, n_terms). Report one_norm when normalizing objectives.

How to cite

Please include both citations: the original paper and the Aqora dataset entry you actually used (with version).

1) Original HamLib paper

@article{Sawaya_2024,
   title={HamLib: A library of Hamiltonians for benchmarking quantum algorithms and hardware},
   volume={8},
   ISSN={2521-327X},
   url={http://dx.doi.org/10.22331/q-2024-12-11-1559},
   DOI={10.22331/q-2024-12-11-1559},
   journal={Quantum},
   publisher={Verein zur Forderung des Open Access Publizierens in den Quantenwissenschaften},
   author={Sawaya, Nicolas PD and Marti-Dafcik, Daniel and Ho, Yang and Tabor, Daniel P and Neira, David E Bernal and Magann, Alicia B and Premaratne, Shavindra and Dubey, Pradeep and Matsuura, Anne and Bishop, Nathan and Jong, Wibe A de and Benjamin, Simon and Parekh, Ojas and Tubman, Norm and Klymko, Katherine and Camps, Daan},
   year={2024},
   month=dec, pages={1559} }
}