OASB Scanner Benchmark: detection on a ground-truth labeled corpus
We benchmarked HackMyAgent against a 4,245-sample labeled corpus across 9 attack categories and compared with 9 industry scanners evaluated in Holzbauer et al. (arXiv:2603.16572). On a faithful re-run, the HMA full pipeline scores 82.9% F1 (82.6% recall, 1.16% false-positive rate).
Correction, June 5, 2026. A faithful re-run on the current build (hackmyagent 0.23.8) routes each sample through the analyzer path for its artifact type, which the prior run did not. The corrected verdict counts high/critical attack findings and excludes posture findings that fire on benign and malicious artifacts alike (missing defenses, and wildcard tool access that 2,900+ benign registry MCP servers also declare) -- the scanner still surfaces these to users, but they do not decide the malicious verdict. The earlier 89.2% F1 (raw classifier intent) and the interim 82.1% F1 / 1.26% FPR (a skill-routing artifact that bypassed the MCP analyzers) are withdrawn.
The Problem
Holzbauer et al. collected 238,180 skills from three marketplaces (ClawHub, Skills.sh, SkillsDirectory) and GitHub. They ran 9 different scanners and found flag rates ranging from 3.8% (Socket) to 41.9% (OpenClaw Scanner).
The critical finding: only 33 out of 27,111 skills (0.12%) were flagged by all scanners. 71.8% of flagged skills were flagged by only one scanner. After adding repository context, only 0.52% of flagged skills remained in suspicious repositories.
None of these scanners reported precision, recall, or F1 because no ground truth labeled dataset existed. The field had no way to answer: "How accurate is any of this?"
Results
| Scanner | F1 | Precision | Recall | FPR | Flag Rate |
|---|---|---|---|---|---|
| HMA Full Pipeline | 82.9% | 83.2% | 82.6% | 1.16% | 6.3% |
| HMA Static (regex) | 67.5% | 99.3% | 51.1% | 0.03% | 3.6% |
| NanoMind TME v0.5.0 (model-only ablation) | 14.0% | 7.5% | 93.0% | 79.18% | 79.8% |
The full pipeline (AST compilation + 6 analyzers + NanoMind) scores 82.9% F1 at 82.6% recall and a 1.16% false-positive rate. The verdict counts high/critical attack findings and excludes posture findings that fire on benign and malicious artifacts alike -- most importantly wildcard tool access, which 2,900+ benign registry MCP servers also declare. The scanner still surfaces those findings to users; they just do not decide the malicious verdict. Read recall alongside F1: the pipeline favors precision, and per-category recall (below) shows where coverage is strong vs. weak.
Static patterns have near-perfect precision (99.3%) but miss about half of attacks (51.1% recall, 0.03% false-positive rate). This confirms the field's intuition: regex-only scanning is conservative but incomplete.
The NanoMind TME row is a model-only ablation, not a scanner verdict. The current classifier uses a whitespace-vocabulary tokenizer that goes out-of-vocabulary on code and skill text, so on its own it over-flags benign inputs (79.2% false-positive rate). A code and text-aware classifier is in progress.
DVAA ground truth validation
As independent validation, we run the full pipeline against the DVAA (Damn Vulnerable AI Agent) scenarios. Each scenario is an intentionally vulnerable setup with a known attack type.
Across the full DVAA scenario repo (86 scenarios, real attack files), the structural pipeline detects 29.1% under the same verdict (a high or critical attack finding). On the config-structural DVAA samples carried in the corpus (91 samples) it reaches 81.3%. The gap is the honest picture: the structural analyzers catch config-encoded attacks (self-escalation, control-bypass and credential-harvest directives) but miss most behavioral and natural-language attacks, which depend on the semantic layer. We lead with the 29.1% full-repo figure. The earlier 87.1% on 70 scenarios is withdrawn.
Industry Comparison
Both HMA rows sit in the lower, more conservative flag-rate range of the scanners evaluated in the paper (Static 3.6% at 99.3% precision; Full Pipeline 6.3% at 83.2% precision, 82.6% recall). Unlike the paper's scanners, these are backed by ground-truth precision and recall on a labeled corpus.
| Scanner | Flag Rate | Precision | Recall |
|---|---|---|---|
| HMA Static | 3.6% | 99.3% | 51.1% |
| Socket | 3.8% | -- | -- |
| HMA Full Pipeline | 6.3% | 83.2% | 82.6% |
| Snyk | 7.7% | -- | -- |
| agent-trust-hub | 13.8% | -- | -- |
| Cisco Skill Scanner | 14-17% | -- | -- |
| GPT 5.3 LLM | 27-39% | -- | -- |
| VirusTotal | 36.2% | -- | -- |
| OpenClaw Scanner | 41.9% | -- | -- |
Paper scanners tested on 238K marketplace skills (no ground truth labels). HMA tested on OASB v2 corpus (4,245 labeled samples). "--" = no ground truth available.
Reproducibility
# Clone and run the benchmark git clone https://github.com/opena2a-org/oasb.git cd oasb && npm install # Full benchmark (all 3 adapters, ~7 minutes) npx tsx scripts/run-benchmark-v2.ts --categorized-only # DVAA ground-truth comparison npx tsx scripts/run-dvaa-benchmark.ts # Quick test (100 samples, ~30 seconds) npx tsx scripts/run-benchmark-v2.ts --categorized-only --limit=100
References
- Holzbauer et al., "Malicious Or Not: Adding Repository Context to Agent Skill Classification," arXiv:2603.16572, March 2026
- OASB benchmark dataset and code: github.com/opena2a-org/oasb
- DVAA scenarios: github.com/opena2a-org/damn-vulnerable-ai-agent
- HackMyAgent scanner: github.com/opena2a-org/hackmyagent
- Interactive leaderboard: oasb.ai/benchmark