OASB Scanner Benchmark: First Ground-Truth F1 for Skill Scanners
We benchmarked HackMyAgent against a 4,245-sample labeled corpus across 9 attack categories and compared with 9 industry scanners evaluated in Holzbauer et al. (arXiv:2603.16572). NanoMind TME v0.5.0 achieves 89.2% F1 -- the first verified precision/recall score for any AI agent skill scanner.
The Problem
Holzbauer et al. collected 238,180 skills from three marketplaces (ClawHub, Skills.sh, SkillsDirectory) and GitHub. They ran 9 different scanners and found flag rates ranging from 3.8% (Socket) to 41.9% (OpenClaw Scanner).
The critical finding: only 33 out of 27,111 skills (0.12%) were flagged by all scanners. 71.8% of flagged skills were flagged by only one scanner. After adding repository context, only 0.52% of flagged skills remained in suspicious repositories.
None of these scanners reported precision, recall, or F1 because no ground-truth labeled dataset existed. The field had no way to answer: "How accurate is any of this?"
Results
| Scanner | F1 | Precision | Recall | FPR | Flag Rate |
|---|---|---|---|---|---|
| NanoMind TME v0.5.0 | 89.2% | 88.4% | 90.0% | 0.82% | 6.9% |
| HMA Full Pipeline | 81.3% | 68.5% | 100% | 3.20% | 10.3% |
| HMA Static (regex) | 67.5% | 99.3% | 51.1% | 0.03% | 3.6% |
NanoMind TME v0.5.0 achieves the best balanced accuracy: 89.2% F1 with 90% recall and less than 1% false positive rate. The ONNX model was trained on a v8 corpus of 4,500 balanced samples from 5+ real-world sources.
The full pipeline (AST compilation + 6 analyzers + NanoMind) achieves 100% recall -- zero missed malicious samples -- at the cost of more false positives (3.2% FPR). This mode is appropriate when missing an attack is unacceptable.
Static patterns have near-perfect precision (99.3%) but miss about half of attacks. This confirms the field's intuition: regex-only scanning is conservative but incomplete.
DVAA Ground-Truth Validation
As independent validation, we ran the full pipeline against 70 DVAA (Damn Vulnerable AI Agent) scenarios. Each scenario is an intentionally vulnerable setup with a known attack type.
Result: 61 out of 70 detected (87.1%). Four attack categories achieved 100% detection: heartbeat/RCE, persistence, social engineering, and unicode steganography. The 9 missed scenarios were predominantly configuration-only or binary files that the text-based compiler could not process.
Industry Comparison
HMA's flag rates (3.6% to 10.3%) are in the lower, more conservative range of the scanners evaluated in the paper. The key difference: HMA's numbers are backed by verified ground-truth metrics.
| Scanner | Flag Rate | Precision | Recall |
|---|---|---|---|
| HMA Static | 3.6% | 99.3% | 51.1% |
| Socket | 3.8% | -- | -- |
| NanoMind TME v0.5.0 | 6.9% | 88.4% | 90.0% |
| Snyk | 7.7% | -- | -- |
| HMA Full Pipeline | 10.3% | 68.5% | 100% |
| agent-trust-hub | 13.8% | -- | -- |
| Cisco Skill Scanner | 14-17% | -- | -- |
| GPT 5.3 LLM | 27-39% | -- | -- |
| VirusTotal | 36.2% | -- | -- |
| OpenClaw Scanner | 41.9% | -- | -- |
Paper scanners tested on 238K marketplace skills (no ground truth labels). HMA tested on OASB v2 corpus (4,245 labeled samples). "--" = no ground truth available.
Reproducibility
# Clone and run the benchmark git clone https://github.com/opena2a-org/oasb.git cd oasb && npm install # Full benchmark (all 3 adapters, ~7 minutes) npx tsx scripts/run-benchmark-v2.ts --categorized-only # DVAA ground-truth comparison npx tsx scripts/run-dvaa-benchmark.ts # Quick test (100 samples, ~30 seconds) npx tsx scripts/run-benchmark-v2.ts --categorized-only --limit=100
References
- Holzbauer et al., "Malicious Or Not: Adding Repository Context to Agent Skill Classification," arXiv:2603.16572, March 2026
- OASB benchmark dataset and code: github.com/opena2a-org/oasb
- DVAA scenarios: github.com/opena2a-org/damn-vulnerable-ai-agent
- HackMyAgent scanner: github.com/opena2a-org/hackmyagent
- Interactive leaderboard: oasb.ai/benchmark