One-liner: A small, reproducible dataset and reference pipeline that augments classic code-vulnerability scoring with a quantum-inspired self-doubt signal derived from spectral randomized Max-Cut over token co-occurrence graphs — inspired by OpenAI’s AARDVARK security agent workflow.
1) Why this exists
Modern agents like AARDVARK (
https://x.com/sama/status/1984002552158154905?s=48&t=oN9TNIKsaabOiTF25rJGrQ) analyze repositories, find vulnerabilities, and even open PRs. But agentic systems can be 
overconfident.
Q-AARDVARK introduces a 
self-audit step: for each function we compute a 
self-doubt score that reflects ambiguity in the “attack surface” structure of the code. High doubt ⇒ down-weight the vulnerability probability to reduce false positives.
Highlights
- Dataset of C functions with labels (vulnerable / safe).
- Baseline probability from a TF-IDF + Logistic Regression classifier.
- Self-doubt score from a quantum-inspired Max-Cut heuristic (no quantum hardware required).
- Adjusted probability = baseline × (1 − α × self_doubt).
- Ready for Aqora and Hugging Face Spaces posts/demos.
2) What’s inside (artifacts)
| File | Description | 
|---|
| devign_self_audit.jsonl | Main dataset: one JSON object per function with fields described below. | 
| predictions.json | Same content as .jsonlbut in a single JSON array (easier for quick inspection). | 
| predictions_with_notes.json | Adds patch_notes (heuristic hints for common C risks). | 
| prob_shift.png | Scatter plot: baseline vs self-audited probabilities. | 
| RESULTS.txt | Short metrics summary (baseline vs self-audited). | 
| TOP_RISKY_SAMPLE.txt | A post-ready snippet of the top risky function with notes and truncated code. | 
| JSON record schema (per line in devign_self_audit.jsonl): |  | 
{
  "idx": 2,                       // index within the test split
  "true_label": 1,                // 1 = vulnerable, 0 = safe
  "baseline_prob": 0.819,         // P(vuln) from baseline classifier
  "adjusted_prob": 0.675,         // self-audited probability
  "self_doubt": 0.505,            // ambiguity score in [0, ~1]
  "code": "/* truncated C function up to ~400 chars */",
  "patch_notes": [
    "Use snprintf with explicit buffer sizes to avoid overflow.",
    "Avoid shell invocation; sanitize inputs or use execve/posix_spawn with arg arrays."
  ]
}
3) How the data was generated
3.1 Baseline classifier (classical)
- Text features: TF-IDF over C tokens (regex: identifiers and alphanumerics).
- Model: Logistic Regression.
- Output: baseline_prob = P(vulnerable | features).
3.2 Token graph construction
For each function:
- Tokenize to identifiers.
- Keep top-k frequent tokens (default k=40).
- Build an undirected weighted graph where an edge connects tokens that co-occur within a sliding window (default W=3) and edge weight counts co-occurrence frequency.
3.3 Quantum-inspired self-doubt (spectral randomized Max-Cut)
- Form the Laplacian L = D − W from the weighted adjacency W.
- Take the Fiedler vector (2nd smallest eigenvector of L).
- Add small Gaussian noise and randomized rounding across many trials (default 64) to produce many candidate cuts.
- For each cut, compute cut value (sum of weights across the partition).
- Derive self_doubt from the distribution of cut values:
- near-optimal fraction: proportion of trials within 5% of the best observed value.
- variance of cut values.
- Combine:
self_doubt = 0.5 * near + 0.5 * (var / (1 + var))Intuition: If many different cuts are nearly as good as the best cut, the graph has multiple plausible “threat partitions” → the agent should be less confident.
 
3.4 Self-audited probability
- Choose penalty strength α ∈ [0,1](default 0.35).
- Compute:
adjusted_prob = clip( baseline_prob * (1 − α * self_doubt ), 0, 1 )
4) Quick start (Python)
import json, pandas as pd
# Load JSONL
rows = [json.loads(line) for line in open("devign_self_audit.jsonl")]
df = pd.DataFrame(rows)
# Inspect top 5 by adjusted risk
df.sort_values("adjusted_prob", ascending=False).head(5)[
    ["idx","true_label","baseline_prob","self_doubt","adjusted_prob"]
]
Plot baseline vs adjusted
import matplotlib.pyplot as plt
plt.figure(figsize=(6,4))
plt.scatter(df["baseline_prob"], df["adjusted_prob"], s=10)
plt.xlabel("Baseline probability")
plt.ylabel("Self-audited probability")
plt.title("Effect of Self-Doubt Re-ranking")
plt.grid(True)
plt.show()
Filter high-doubt functions (for manual review)
review = df[(df["self_doubt"] >= 0.65) & (df["adjusted_prob"] >= 0.4)]
review[["idx","self_doubt","baseline_prob","adjusted_prob"]].head(20)
5) Reproduce end-to-end
- Train TF-IDF + Logistic on your code snippets (or use the synthetic C functions provided).
- For each function:
- Build token graph (top-k=40, window=3).
- Run spectral randomized Max-Cut with 64 trials, noise=0.10.
- Compute self_doubt,adjusted_prob.
 
- Export devign_self_audit.jsonland plotprob_shift.png.
6) Suggested uses
- Agent confidence calibration: use adjusted_probto gate auto-PR creation or to route to human-in-the-loop review.
- Triage: sort by high adjusted_probbut also highself_doubtto find risky and ambiguous code units.
- Benchmarking: compare your own static/dynamic analyzers with/without self-audit.
- Teaching: show students how probabilistic structure (many near-optimal cuts) implies epistemic uncertainty.
7) Limitations & future work
- Synthetic bias: the included sample uses pattern-based vulnerable/safe functions (e.g., strcpy,gets,sprintf,system,atoi). Real projects have subtler flows; you should run the same pipeline on your repos.
- Graph proxy: co-occurrence graphs approximate control/data flow. Consider real CFG/DFG extraction (Clang/LLVM) to improve fidelity.
- Heuristic notes: patch_notesare deterministic patterns; in production, replace with an LLM + sandbox validation.
- Quantum proper: this version is quantum-inspired. If you want true quantum evaluation, plug in a QAOA/Max-Cut instance via Qiskit or other frameworks and map cut-landscape stats to self_doubt.
Roadmap
- CFG/DFG-based graphs with taint analysis.
- Multi-objective self-doubt (add path coverage / fuzzing uncertainty).
- Repo-level aggregation (function → file → service).
- Federated “immune-graph” across many codebases.
10) Acknowledgements
- Inspired by discussions around AARDVARK (AI security agent workflows).
- Thanks to the open-source Python ecosystem: scikit-learn, NetworkX, NumPy, Matplotlib.
- Quantum inspiration from Max-Cut/QAOA literature and spectral relaxations.
Appendix — Minimal loader (CLI)
python - << 'PY'
import json, sys
path = "devign_self_audit.jsonl"
with open(path) as f:
    for i, line in enumerate(f):
        if i>=3: break
        print(line.strip())
PY