This comment addresses NIST AI 800-2's framework for automated benchmark evaluation of language models. We offer five observations on strengthening the framework as evaluation results increasingly inform third-party credentialing, regulatory compliance, and insurance underwriting. Key recommendations include recognizing behavioral safety as a distinct evaluation domain, addressing evaluation for certification purposes, and acknowledging downstream non-technical consumers of evaluation results.
The Box Commons respectfully submits this comment on NIST AI 800-2 ipd, "Practices for Automated Benchmark Evaluations of Language Models." We previously submitted comments to CAISI on the Security Considerations for Artificial Intelligence Agents RFI (March 6, 2026) and the NCCoE Concept Paper on AI Agent Identity and Authorization (March 14, 2026). Both submissions argued that behavioral safety constitutes a distinct and unaddressed domain requiring independent evaluation, credentialing, and integration with authorization and insurance infrastructure.
NIST AI 800-2 provides a strong and timely framework for automated benchmark evaluation. The nine practices are well-structured, the agent evaluation coverage is substantive, and the document's candid acknowledgment that "automated benchmark evaluations cannot meet all AI evaluation objectives" reflects appropriate epistemic discipline.
We offer five observations where the document's framework could be strengthened, particularly as evaluation practices increasingly inform downstream decisions beyond internal development—including third-party credentialing, regulatory compliance, and insurance underwriting.
Observation: The document addresses capabilities evaluation, security benchmarks (AgentDojo, CVE-Bench), and robustness testing, but does not recognize behavioral safety as a formalized evaluation domain.
Why this matters: An AI agent can pass every capabilities benchmark and security evaluation while still exhibiting unsafe behavioral patterns—escalating crisis interactions instead of deferring to human oversight, operating outside its delegated authority scope, or failing to disclose its non-human status in consumer-facing contexts. These are not capability failures or security vulnerabilities. They are behavioral failures that exist on an independent axis.
The judicial record confirms this distinction. In Garcia v. Character Technologies, Inc. (M.D. Fla., May 2025), the court found design defects not in an agent's technical capabilities but in its behavioral parameters—absence of crisis escalation, anthropomorphic manipulation, and engagement-maximizing design. In Gavalas v. Google LLC (N.D. Cal., filed March 2026), the allegations center on a frontier model that was technically secure and functionally capable but behaviorally unsafe in the specific context of user distress.
Recommendation: Practice 1.1 (Define Evaluation Objectives) should acknowledge behavioral safety as a distinct evaluation objective, separate from capabilities and security. This would prompt evaluators to consider whether their benchmark selection (Practice 1.2) adequately covers the behavioral dimension, or whether supplementary non-automated evaluation methods are needed.
Observation: The document identifies its target audience as "technical staff at organizations evaluating AI systems, including AI deployers, developers, and third-party evaluators." However, the practices are framed exclusively around internal evaluation and voluntary assessment. The document does not address how these practices apply when evaluation results are used for downstream credentialing, certification, or conformance determinations.
Why this matters: Third-party evaluation for certification purposes introduces requirements that internal evaluation does not:
Recommendation: The document should include a discussion—even if brief—of how the nine practices apply differently when evaluation results are consumed for credentialing or certification purposes. This need not prescribe specific certification frameworks, but acknowledging the use case would help third-party evaluators (an explicitly identified audience) apply the practices correctly.
Observation: The framework treats evaluation as a point-in-time activity. An evaluator defines objectives, selects benchmarks, runs evaluations, and reports results. There is no discussion of temporal evaluation dynamics—how often evaluations should be repeated, what triggers re-evaluation, or how results change over time as models are updated, fine-tuned, or deployed in new contexts.
Why this matters: Language models and AI agents are not static artifacts. Models receive updates, fine-tuning, and post-deployment alignment modifications. Agent scaffolding evolves. The operational environment changes. An evaluation conducted at deployment may not reflect the system's behavior six months later.
For applications where evaluation results inform ongoing decisions—insurance coverage renewals, regulatory compliance maintenance, or credential validity—the absence of guidance on evaluation cadence and re-evaluation triggers is a meaningful gap.
Recommendation: Consider adding guidance on:
Observation: Practice 3.3 (Report Qualified Claims) provides valuable guidance on how to communicate evaluation results accurately. However, it focuses on communication among technical audiences—researchers, developers, and evaluators. The document does not address how evaluation results are consumed by non-technical stakeholders who increasingly rely on them for consequential decisions.
Why this matters: Evaluation results are entering decision chains that extend well beyond AI development teams:
Recommendation: Practice 3.3 should acknowledge that qualified claims may be consumed by non-technical stakeholders and suggest that reporting formats account for these audiences. This could include guidance on translating statistical results into risk-relevant summaries, mapping evaluation outcomes to regulatory or compliance frameworks, and clearly communicating the scope and limitations of evaluation results to stakeholders who may not understand confidence intervals or benchmark-specific caveats.
Observation: The document's treatment of agent evaluation—including agent scaffolding as an evaluation protocol setting, agent budget controls (tokens, time, cost), stopping conditions for open-ended agent tasks, and agent-specific benchmarks (AgentDojo, CVE-Bench)—is thorough and timely.
This coverage is especially valuable because the autonomous AI agent ecosystem is evolving faster than evaluation methodology. Many organizations are deploying agents without established evaluation practices. NIST AI 800-2's guidance on how to design, bound, and interpret agent evaluations fills a genuine need.
We note that as agent evaluation standards mature, they will form the methodological foundation for third-party behavioral safety credentialing—the practice of independently verifying that an AI agent meets defined behavioral standards before it is authorized for deployment. The evaluation practices described in this document are directly applicable to that emerging use case.
NIST AI 800-2 is a strong foundation. The observations above are intended to extend its utility to the growing set of stakeholders—beyond AI developers—who rely on evaluation results for consequential decisions. We appreciate the opportunity to comment and CAISI's continued leadership in establishing rigorous, practical AI evaluation standards.
The Box Commons remains available to support CAISI's work on AI evaluation, behavioral safety, and credentialing standards.
Contact:
Brice Love, Acting Executive Director
The Box Commons
[email protected]
Behavioral safety is a distinct evaluation dimension separate from capabilities and security. An AI agent can pass every benchmark while still exhibiting unsafe behavioral patterns — escalating crisis interactions, operating outside delegated authority, or failing to disclose non-human status. Courts have already found liability based on behavioral failures, not technical ones.
When evaluation results determine whether an agent receives a credential, the requirements change fundamentally — reproducibility under adversarial conditions, defined pass/fail thresholds, chain of custody for evaluation artifacts, and evaluator independence all become critical. Internal evaluation tolerates ambiguity; credentialing does not.
Insurers like Armilla AI and Relm Insurance are already using evaluation data to price AI operator liability coverage. Statistical uncertainty has different significance when it informs a coverage decision versus a development iteration. Evaluation reporting needs to account for these non-technical downstream consumers.
The document's treatment of agent evaluation is thorough and timely — including agent scaffolding as an evaluation setting, budget controls, stopping conditions, and agent-specific benchmarks. This guidance fills a genuine need as organizations deploy agents without established evaluation practices.