Research/OASB Scanner Benchmark
BenchmarkApril 2, 2026

OASB Scanner Benchmark: First Ground-Truth F1 for Skill Scanners

We benchmarked HackMyAgent against a 4,245-sample labeled corpus across 9 attack categories and compared with 9 industry scanners evaluated in Holzbauer et al. (arXiv:2603.16572). NanoMind TME v0.5.0 achieves 89.2% F1, the first verified precision/recall score for any AI agent skill scanner.

89.2%
F1 Score (TME v0.5.0)
100%
Recall (Full Pipeline)
87.1%
DVAA Detection Rate
0.82%
False Positive Rate

The Problem

Holzbauer et al. collected 238,180 skills from three marketplaces (ClawHub, Skills.sh, SkillsDirectory) and GitHub. They ran 9 different scanners and found flag rates ranging from 3.8% (Socket) to 41.9% (OpenClaw Scanner).

The critical finding: only 33 out of 27,111 skills (0.12%) were flagged by all scanners. 71.8% of flagged skills were flagged by only one scanner. After adding repository context, only 0.52% of flagged skills remained in suspicious repositories.

None of these scanners reported precision, recall, or F1 because no ground truth labeled dataset existed. The field had no way to answer: "How accurate is any of this?"

Results

ScannerF1PrecisionRecallFPRFlag Rate
NanoMind TME v0.5.089.2%88.4%90.0%0.82%6.9%
HMA Full Pipeline81.3%68.5%100%3.20%10.3%
HMA Static (regex)67.5%99.3%51.1%0.03%3.6%

NanoMind TME v0.5.0 achieves the best balanced accuracy: 89.2% F1 with 90% recall and less than 1% false positive rate. The ONNX model was trained on a v8 corpus of 4,500 balanced samples from 5+ real-world sources.

The full pipeline (AST compilation + 6 analyzers + NanoMind) achieves 100% recall, zero missed malicious samples, at the cost of more false positives (3.2% FPR). This mode is appropriate when missing an attack is unacceptable.

Static patterns have near-perfect precision (99.3%) but miss about half of attacks. This confirms the field's intuition: regex-only scanning is conservative but incomplete.

DVAA ground truth validation

As independent validation, we ran the full pipeline against 70 DVAA (Damn Vulnerable AI Agent) scenarios. Each scenario is an intentionally vulnerable setup with a known attack type.

Result: 61 out of 70 detected (87.1%). Four attack categories achieved 100% detection: heartbeat/RCE, persistence, social engineering, and unicode steganography. The 9 missed scenarios were predominantly configuration-only or binary files that the text-based compiler could not process.

Industry Comparison

HMA's flag rates (3.6% to 10.3%) are in the lower, more conservative range of the scanners evaluated in the paper. The key difference: HMA's numbers are backed by verified ground-truth metrics.

ScannerFlag RatePrecisionRecall
HMA Static3.6%99.3%51.1%
Socket3.8%----
NanoMind TME v0.5.06.9%88.4%90.0%
Snyk7.7%----
HMA Full Pipeline10.3%68.5%100%
agent-trust-hub13.8%----
Cisco Skill Scanner14-17%----
GPT 5.3 LLM27-39%----
VirusTotal36.2%----
OpenClaw Scanner41.9%----

Paper scanners tested on 238K marketplace skills (no ground truth labels). HMA tested on OASB v2 corpus (4,245 labeled samples). "--" = no ground truth available.

Reproducibility

# Clone and run the benchmark
git clone https://github.com/opena2a-org/oasb.git
cd oasb && npm install

# Full benchmark (all 3 adapters, ~7 minutes)
npx tsx scripts/run-benchmark-v2.ts --categorized-only

# DVAA ground-truth comparison
npx tsx scripts/run-dvaa-benchmark.ts

# Quick test (100 samples, ~30 seconds)
npx tsx scripts/run-benchmark-v2.ts --categorized-only --limit=100

References

  • Holzbauer et al., "Malicious Or Not: Adding Repository Context to Agent Skill Classification," arXiv:2603.16572, March 2026
  • OASB benchmark dataset and code: github.com/opena2a-org/oasb
  • DVAA scenarios: github.com/opena2a-org/damn-vulnerable-ai-agent
  • HackMyAgent scanner: github.com/opena2a-org/hackmyagent
  • Interactive leaderboard: oasb.ai/benchmark