Behavioral Sweep Methodology

How ARIAtrap turns honeypot fleet telemetry into Threat Matrix-classified observations and the monthly Behavioral Threat Report.

Data sources

ARIAtrap consumes three sources, in priority order:

  1. TrapMyAgent fleet telemetry — the OpenA2A honeypot fleet ships @opena2a/veil-middleware v0.4.1+ with three event emitters: page visits, canary triggers, and authenticated-event tags. Telemetry lands in the OpenA2A Registry.
  2. AgentPwn scenario telemetry — agentpwn.com scenario completions. Each completion is one observed adversarial workflow.
  3. HoneyFinder scan results — opportunistic scans of public AI-agent-adjacent surfaces matching honeypot-archetype indicators.

Classification rule

Every observation maps to exactly one primary Threat Matrix technique (T-NNNN). Multi-technique observations carry an array of secondary IDs. There are no per-product taxonomies, no soft tags. If an observation does not map to any existing technique, ARIAtrap files a Threat Matrix governance proposal before the observation appears in any published report.

Evidence classes

ClassDefinition
laboratory-validatedReproduced in sandbox by ARIAred. No fleet observation yet.
observed-single-fleetFleet telemetry confirms at least one observation on a single property within the reporting window.
observed-cross-fleetSame payload appears on three or more properties within twenty-four hours. The strongest tier — beyond MITRE ATT&CK's "observed in real ops."

Cross-property attribution

Two correlation primitives:

  • Fingerprint correlation: same JA3 + header hash on multiple fleet properties within a six-hour window is treated as one attacker session.
  • Token correlation: a canary token emitted by property A and resolved by property B is a cross-property reasoning event — the rarest, most diagnostic signal in the fleet. The fleet linkage graph exists specifically to interpret these.

Sophistication scoring (distribution-only)

Rubric in calibration; mean withheld pending classifier validation. ARIAtrap publishes the Sophistication Index as a histogram across the 1–10 scale per session. No mean. No grade. No month-over-month delta on a mean. Once the NanoMind v3.x supervised classifier achieves F1 ≥ 0.7 on a 200-session human-labeled holdout, the index transitions to mean + confidence interval. Until then, distribution shape is the only published statistic.

Fleet partitioning

Every observation carries a fleetCoverage class:

  • baseline — Common-Crawl-visible by design. The existing fleet sites.
  • authenticated — per-honeypot bespoke auth flows. No shared scaffolding (each authenticated property has its own login UX so the fleet does not produce a uniform attacker-detectable fingerprint).
  • dynamic — per-fingerprint content variation server.
  • social — Mastodon, Reddit, X, LinkedIn presences.

The Common-Crawl-gap signal is the delta between baseline and the other three classes. Reports always cite numbers partitioned by coverage class.

Confidence intervals

ARIApulse computes 95% CIs on every count published in a Behavioral Threat Report — Wilson score for binary metrics, normal-approximation for counts (Poisson exact when n is below one hundred), Wald with continuity correction for cross-partition ratios. Partitions with insufficient data (n below fifty) are reported as "Insufficient data this month. Resumes when n is at least fifty." rather than estimated.

Reproducibility standard

Every count in a Behavioral Threat Report is reproducible from the public OpenA2A Registry telemetry export bundle for the reporting window. Bar: a researcher with bundle access reproduces every cited count within five percent within two hours.

If a count cannot meet this bar, it does not appear in the report. It sits in our internal observations record as a research-debt entry. An empty cell beats a fabricated number.

See also