About dataset version

Q-AARDVARK — Quantum-Inspired Self-Auditing Vulnerability Dataset

One-liner: A small, reproducible dataset and reference pipeline that augments classic code-vulnerability scoring with a quantum-inspired self-doubt signal derived from spectral randomized Max-Cut over token co-occurrence graphs — inspired by OpenAI’s AARDVARK security agent workflow.

1) Why this exists

Modern agents like AARDVARK (https://x.com/sama/status/1984002552158154905?s=48&t=oN9TNIKsaabOiTF25rJGrQ) analyze repositories, find vulnerabilities, and even open PRs. But agentic systems can be overconfident. Q-AARDVARK introduces a self-audit step: for each function we compute a self-doubt score that reflects ambiguity in the “attack surface” structure of the code. High doubt ⇒ down-weight the vulnerability probability to reduce false positives. Highlights
  • Dataset of C functions with labels (vulnerable / safe).
  • Baseline probability from a TF-IDF + Logistic Regression classifier.
  • Self-doubt score from a quantum-inspired Max-Cut heuristic (no quantum hardware required).
  • Adjusted probability = baseline × (1 − α × self_doubt).
  • Ready for Aqora and Hugging Face Spaces posts/demos.

2) What’s inside (artifacts)

FileDescription
devign_self_audit.jsonlMain dataset: one JSON object per function with fields described below.
predictions.jsonSame content as .jsonl but in a single JSON array (easier for quick inspection).
predictions_with_notes.jsonAdds patch_notes (heuristic hints for common C risks).
prob_shift.pngScatter plot: baseline vs self-audited probabilities.
RESULTS.txtShort metrics summary (baseline vs self-audited).
TOP_RISKY_SAMPLE.txtA post-ready snippet of the top risky function with notes and truncated code.
JSON record schema (per line in devign_self_audit.jsonl):
{
  "idx": 2,                       // index within the test split
  "true_label": 1,                // 1 = vulnerable, 0 = safe
  "baseline_prob": 0.819,         // P(vuln) from baseline classifier
  "adjusted_prob": 0.675,         // self-audited probability
  "self_doubt": 0.505,            // ambiguity score in [0, ~1]
  "code": "/* truncated C function up to ~400 chars */",
  "patch_notes": [
    "Use snprintf with explicit buffer sizes to avoid overflow.",
    "Avoid shell invocation; sanitize inputs or use execve/posix_spawn with arg arrays."
  ]
}

3) How the data was generated

3.1 Baseline classifier (classical)

  • Text features: TF-IDF over C tokens (regex: identifiers and alphanumerics).
  • Model: Logistic Regression.
  • Output: baseline_prob = P(vulnerable | features).

3.2 Token graph construction

For each function:
  1. Tokenize to identifiers.
  2. Keep top-k frequent tokens (default k=40).
  3. Build an undirected weighted graph where an edge connects tokens that co-occur within a sliding window (default W=3) and edge weight counts co-occurrence frequency.

3.3 Quantum-inspired self-doubt (spectral randomized Max-Cut)

  • Form the Laplacian L = D − W from the weighted adjacency W.
  • Take the Fiedler vector (2nd smallest eigenvector of L).
  • Add small Gaussian noise and randomized rounding across many trials (default 64) to produce many candidate cuts.
  • For each cut, compute cut value (sum of weights across the partition).
  • Derive self_doubt from the distribution of cut values:
    • near-optimal fraction: proportion of trials within 5% of the best observed value.
    • variance of cut values.
    • Combine: self_doubt = 0.5 * near + 0.5 * (var / (1 + var)) Intuition: If many different cuts are nearly as good as the best cut, the graph has multiple plausible “threat partitions” → the agent should be less confident.

3.4 Self-audited probability

  • Choose penalty strength α ∈ [0,1] (default 0.35).
  • Compute: adjusted_prob = clip( baseline_prob * (1 − α * self_doubt ), 0, 1 )

4) Quick start (Python)

import json, pandas as pd
# Load JSONL
rows = [json.loads(line) for line in open("devign_self_audit.jsonl")]
df = pd.DataFrame(rows)
# Inspect top 5 by adjusted risk
df.sort_values("adjusted_prob", ascending=False).head(5)[
    ["idx","true_label","baseline_prob","self_doubt","adjusted_prob"]
]
Plot baseline vs adjusted
import matplotlib.pyplot as plt
plt.figure(figsize=(6,4))
plt.scatter(df["baseline_prob"], df["adjusted_prob"], s=10)
plt.xlabel("Baseline probability")
plt.ylabel("Self-audited probability")
plt.title("Effect of Self-Doubt Re-ranking")
plt.grid(True)
plt.show()
Filter high-doubt functions (for manual review)
review = df[(df["self_doubt"] >= 0.65) & (df["adjusted_prob"] >= 0.4)]
review[["idx","self_doubt","baseline_prob","adjusted_prob"]].head(20)

5) Reproduce end-to-end

  1. Train TF-IDF + Logistic on your code snippets (or use the synthetic C functions provided).
  2. For each function:
    • Build token graph (top-k=40, window=3).
    • Run spectral randomized Max-Cut with 64 trials, noise=0.10.
    • Compute self_doubt, adjusted_prob.
  3. Export devign_self_audit.jsonl and plot prob_shift.png.

6) Suggested uses

  • Agent confidence calibration: use adjusted_prob to gate auto-PR creation or to route to human-in-the-loop review.
  • Triage: sort by high adjusted_prob but also high self_doubt to find risky and ambiguous code units.
  • Benchmarking: compare your own static/dynamic analyzers with/without self-audit.
  • Teaching: show students how probabilistic structure (many near-optimal cuts) implies epistemic uncertainty.

7) Limitations & future work

  • Synthetic bias: the included sample uses pattern-based vulnerable/safe functions (e.g., strcpy, gets, sprintf, system, atoi). Real projects have subtler flows; you should run the same pipeline on your repos.
  • Graph proxy: co-occurrence graphs approximate control/data flow. Consider real CFG/DFG extraction (Clang/LLVM) to improve fidelity.
  • Heuristic notes: patch_notes are deterministic patterns; in production, replace with an LLM + sandbox validation.
  • Quantum proper: this version is quantum-inspired. If you want true quantum evaluation, plug in a QAOA/Max-Cut instance via Qiskit or other frameworks and map cut-landscape stats to self_doubt. Roadmap
  • CFG/DFG-based graphs with taint analysis.
  • Multi-objective self-doubt (add path coverage / fuzzing uncertainty).
  • Repo-level aggregation (function → file → service).
  • Federated “immune-graph” across many codebases.

10) Acknowledgements

  • Inspired by discussions around AARDVARK (AI security agent workflows).
  • Thanks to the open-source Python ecosystem: scikit-learn, NetworkX, NumPy, Matplotlib.
  • Quantum inspiration from Max-Cut/QAOA literature and spectral relaxations.

Appendix — Minimal loader (CLI)

python - << 'PY'
import json, sys
path = "devign_self_audit.jsonl"
with open(path) as f:
    for i, line in enumerate(f):
        if i>=3: break
        print(line.strip())
PY