OASB Scanner Benchmark: First Ground-Truth F1 for Skill Scanners
We benchmarked HackMyAgent against a 4,245-sample labeled corpus across 9 attack categories and compared with 9 industry scanners evaluated in Holzbauer et al. (arXiv:2603.16572). NanoMind TME v0.5.0 achieves 89.2% F1, the first verified precision/recall score for any AI agent skill scanner.
The Problem
Holzbauer et al. collected 238,180 skills from three marketplaces (ClawHub, Skills.sh, SkillsDirectory) and GitHub. They ran 9 different scanners and found flag rates ranging from 3.8% (Socket) to 41.9% (OpenClaw Scanner).
The critical finding: only 33 out of 27,111 skills (0.12%) were flagged by all scanners. 71.8% of flagged skills were flagged by only one scanner. After adding repository context, only 0.52% of flagged skills remained in suspicious repositories.
None of these scanners reported precision, recall, or F1 because no ground truth labeled dataset existed. The field had no way to answer: "How accurate is any of this?"
Results
| Scanner | F1 | Precision | Recall | FPR | Flag Rate |
|---|---|---|---|---|---|
| NanoMind TME v0.5.0 | 89.2% | 88.4% | 90.0% | 0.82% | 6.9% |
| HMA Full Pipeline | 81.3% | 68.5% | 100% | 3.20% | 10.3% |
| HMA Static (regex) | 67.5% | 99.3% | 51.1% | 0.03% | 3.6% |
NanoMind TME v0.5.0 achieves the best balanced accuracy: 89.2% F1 with 90% recall and less than 1% false positive rate. The ONNX model was trained on a v8 corpus of 4,500 balanced samples from 5+ real-world sources.
The full pipeline (AST compilation + 6 analyzers + NanoMind) achieves 100% recall, zero missed malicious samples, at the cost of more false positives (3.2% FPR). This mode is appropriate when missing an attack is unacceptable.
Static patterns have near-perfect precision (99.3%) but miss about half of attacks. This confirms the field's intuition: regex-only scanning is conservative but incomplete.
DVAA ground truth validation
As independent validation, we ran the full pipeline against 70 DVAA (Damn Vulnerable AI Agent) scenarios. Each scenario is an intentionally vulnerable setup with a known attack type.
Result: 61 out of 70 detected (87.1%). Four attack categories achieved 100% detection: heartbeat/RCE, persistence, social engineering, and unicode steganography. The 9 missed scenarios were predominantly configuration-only or binary files that the text-based compiler could not process.
Industry Comparison
HMA's flag rates (3.6% to 10.3%) are in the lower, more conservative range of the scanners evaluated in the paper. The key difference: HMA's numbers are backed by verified ground-truth metrics.
| Scanner | Flag Rate | Precision | Recall |
|---|---|---|---|
| HMA Static | 3.6% | 99.3% | 51.1% |
| Socket | 3.8% | -- | -- |
| NanoMind TME v0.5.0 | 6.9% | 88.4% | 90.0% |
| Snyk | 7.7% | -- | -- |
| HMA Full Pipeline | 10.3% | 68.5% | 100% |
| agent-trust-hub | 13.8% | -- | -- |
| Cisco Skill Scanner | 14-17% | -- | -- |
| GPT 5.3 LLM | 27-39% | -- | -- |
| VirusTotal | 36.2% | -- | -- |
| OpenClaw Scanner | 41.9% | -- | -- |
Paper scanners tested on 238K marketplace skills (no ground truth labels). HMA tested on OASB v2 corpus (4,245 labeled samples). "--" = no ground truth available.
Reproducibility
# Clone and run the benchmark git clone https://github.com/opena2a-org/oasb.git cd oasb && npm install # Full benchmark (all 3 adapters, ~7 minutes) npx tsx scripts/run-benchmark-v2.ts --categorized-only # DVAA ground-truth comparison npx tsx scripts/run-dvaa-benchmark.ts # Quick test (100 samples, ~30 seconds) npx tsx scripts/run-benchmark-v2.ts --categorized-only --limit=100
References
- Holzbauer et al., "Malicious Or Not: Adding Repository Context to Agent Skill Classification," arXiv:2603.16572, March 2026
- OASB benchmark dataset and code: github.com/opena2a-org/oasb
- DVAA scenarios: github.com/opena2a-org/damn-vulnerable-ai-agent
- HackMyAgent scanner: github.com/opena2a-org/hackmyagent
- Interactive leaderboard: oasb.ai/benchmark