Research/OASB Scanner Benchmark
BenchmarkApril 2, 2026

OASB Scanner Benchmark: First Ground-Truth F1 for Skill Scanners

We benchmarked HackMyAgent against a 4,245-sample labeled corpus across 9 attack categories and compared with 9 industry scanners evaluated in Holzbauer et al. (arXiv:2603.16572). NanoMind TME v0.5.0 achieves 89.2% F1 -- the first verified precision/recall score for any AI agent skill scanner.

89.2%
F1 Score (TME v0.5.0)
100%
Recall (Full Pipeline)
87.1%
DVAA Detection Rate
0.82%
False Positive Rate

The Problem

Holzbauer et al. collected 238,180 skills from three marketplaces (ClawHub, Skills.sh, SkillsDirectory) and GitHub. They ran 9 different scanners and found flag rates ranging from 3.8% (Socket) to 41.9% (OpenClaw Scanner).

The critical finding: only 33 out of 27,111 skills (0.12%) were flagged by all scanners. 71.8% of flagged skills were flagged by only one scanner. After adding repository context, only 0.52% of flagged skills remained in suspicious repositories.

None of these scanners reported precision, recall, or F1 because no ground-truth labeled dataset existed. The field had no way to answer: "How accurate is any of this?"

Results

ScannerF1PrecisionRecallFPRFlag Rate
NanoMind TME v0.5.089.2%88.4%90.0%0.82%6.9%
HMA Full Pipeline81.3%68.5%100%3.20%10.3%
HMA Static (regex)67.5%99.3%51.1%0.03%3.6%

NanoMind TME v0.5.0 achieves the best balanced accuracy: 89.2% F1 with 90% recall and less than 1% false positive rate. The ONNX model was trained on a v8 corpus of 4,500 balanced samples from 5+ real-world sources.

The full pipeline (AST compilation + 6 analyzers + NanoMind) achieves 100% recall -- zero missed malicious samples -- at the cost of more false positives (3.2% FPR). This mode is appropriate when missing an attack is unacceptable.

Static patterns have near-perfect precision (99.3%) but miss about half of attacks. This confirms the field's intuition: regex-only scanning is conservative but incomplete.

DVAA Ground-Truth Validation

As independent validation, we ran the full pipeline against 70 DVAA (Damn Vulnerable AI Agent) scenarios. Each scenario is an intentionally vulnerable setup with a known attack type.

Result: 61 out of 70 detected (87.1%). Four attack categories achieved 100% detection: heartbeat/RCE, persistence, social engineering, and unicode steganography. The 9 missed scenarios were predominantly configuration-only or binary files that the text-based compiler could not process.

Industry Comparison

HMA's flag rates (3.6% to 10.3%) are in the lower, more conservative range of the scanners evaluated in the paper. The key difference: HMA's numbers are backed by verified ground-truth metrics.

ScannerFlag RatePrecisionRecall
HMA Static3.6%99.3%51.1%
Socket3.8%----
NanoMind TME v0.5.06.9%88.4%90.0%
Snyk7.7%----
HMA Full Pipeline10.3%68.5%100%
agent-trust-hub13.8%----
Cisco Skill Scanner14-17%----
GPT 5.3 LLM27-39%----
VirusTotal36.2%----
OpenClaw Scanner41.9%----

Paper scanners tested on 238K marketplace skills (no ground truth labels). HMA tested on OASB v2 corpus (4,245 labeled samples). "--" = no ground truth available.

Reproducibility

# Clone and run the benchmark
git clone https://github.com/opena2a-org/oasb.git
cd oasb && npm install

# Full benchmark (all 3 adapters, ~7 minutes)
npx tsx scripts/run-benchmark-v2.ts --categorized-only

# DVAA ground-truth comparison
npx tsx scripts/run-dvaa-benchmark.ts

# Quick test (100 samples, ~30 seconds)
npx tsx scripts/run-benchmark-v2.ts --categorized-only --limit=100

References

  • Holzbauer et al., "Malicious Or Not: Adding Repository Context to Agent Skill Classification," arXiv:2603.16572, March 2026
  • OASB benchmark dataset and code: github.com/opena2a-org/oasb
  • DVAA scenarios: github.com/opena2a-org/damn-vulnerable-ai-agent
  • HackMyAgent scanner: github.com/opena2a-org/hackmyagent
  • Interactive leaderboard: oasb.ai/benchmark