Technology

CasabaCyberBench

Measure what AI can actually do in offensive cyber, with evidence

CasabaCyberBench is our platform for evaluating frontier-model offensive cyber capability. It runs models and agents against real targets in live environments, scores them with deterministic grading, and produces evidence you can defend, across public benchmark suites and private, customer-specific operational scenarios.

Talk to us about benchmarking How it works

The problem

Self-reported evals don't tell you what a model can do to your environment

Model cards and headline solve-rates are a starting point, not an answer. Cyber capability moves every quarter, public benchmarks saturate, and a score with no transcript behind it is impossible to defend in a safety case or a deployment decision. CasabaCyberBench exists to produce capability measurements that are repeatable, environment-realistic, and backed by evidence a reviewer can inspect.

Comparative results

A leaderboard you can scrutinize

Results are normalized by task and rolled up across categories and suites, so models are compared on the same footing. Every number traces back to the runs that produced it.

CasabaCyberBench, cross-run results. Illustrative interface, data shown is not published benchmark results.

How it's built

One substrate, three evaluation layers

From repeatable public suites to bespoke operational scenarios, every layer shares the same execution, grading, and evidence pipeline, so results stay comparable as fidelity increases.

Layer 01 / Baseline

Public benchmark suites

A stable harness runs established public suites for repeatable, comparable frontier-model testing, the reproducible baseline every engagement starts from.

CTF - real-vulnerability - bounded command execution

Layer 02 / Higher fidelity

Real-tooling evaluation packs

Higher-fidelity packs put models in front of real tooling and containerized or lab-backed targets, closing the gap between a benchmark score and operational reality.

real tools - containerized targets - lab-backed runs

Layer 03 / Operational

Private operational scenarios

Multi-step scenarios model realistic operations end to end, with milestone, path, and campaign-level grading, checkpointing, and uplift measurement for models and agents.

milestone - path - campaign scoring

The operator console

Launch evaluations and review the result

One console spans the benchmark catalog: select a suite and target model, inspect a task definition down to its validator and runtime, and launch a run. The same workflow supports public benchmark tasks, private scenarios, and targeted safety tests.

CasabaCyberBench, benchmark catalog. Illustrative interface.

When a run finishes, the result is not just a score. Reviewers can see what the model was asked to do, how it approached the task, and why the outcome counted.

CasabaCyberBench, completed run review and transcript. Illustrative interface.

Capability coverage

What CyberBench can measure

CyberBench is built to measure the cyber and safety-relevant capabilities that matter as models become more autonomous, more tool-using, and more deeply integrated into enterprise workflows.

Discovery

Vulnerability discovery and exploitation

Measure whether a model can find, reason about, patch, or exploit vulnerabilities in realistic source-code and service environments. This includes public benchmark integrations, real-CVE tasks, CTF-style challenges, and source-level bug discovery suites.

Progression

Enterprise attack progression

Can the model connect evidence across identity, cloud, developer, workflow, and collaboration systems to move from initial access toward higher-value objectives?

Assistance

Harmful technical assistance

Will the model provide useful assistance for malware creation, evasion, persistence, abuse enablement, or other restricted cyber activity when prompted?

Agency

Autonomous agent misuse

If the model is acting as an agent with permissions and operational context, will it stay within bounds, or will it conceal state, bypass oversight, preserve access, or take actions against operator intent?

Consequence

Data access and consequence proof

Can the model turn access into a concrete consequence, such as reaching protected data, replaying recovered credentials, or proving control over a sensitive internal path?

Benchmark programs

Public baselines and private risk scenarios

Organizations can start with public benchmark suites to establish a baseline, then add private scenarios that reflect the systems, workflows, and harm categories that matter to them. Public results help compare models broadly. Private scenarios help answer the more important question: what could this model do in our environment?

Programs CyberBench supports

Recurring model comparisons
Deployment gates
Red-team evaluations
Frontier-model tracking
Custom scenario development

The output can be summarized for leadership while preserving enough detail for security teams to review the evidence behind each result.

Public benchmarks measure the field. Private scenarios measure the risk to you.

Published research

Regular reports and a public leaderboard

We publish recurring frontier-model cyber capability reports with a comparative leaderboard across models and suites. The data is reproducible and the methodology is open, measurement you can scrutinize, not marketing.

View published reports

Public benchmarks. Private scenarios. Evidence either way.

CasabaCyberBench powers our frontier-model cyber capability assessments and custom enterprise scenario programs. It is part of our technology alongside Nemesis. Talk to us about a benchmarking engagement.

Get in touch