Technology
CasabaCyberBench
Measure what AI can actually do in offensive cyber, with evidence
CasabaCyberBench is our platform for evaluating frontier-model offensive cyber capability. It runs models and agents against real targets in live environments, scores them with deterministic grading, and produces evidence you can defend, across public benchmark suites and private, customer-specific operational scenarios.
The problem
Self-reported evals don't tell you what a model can do to your environment
Model cards and headline solve-rates are a starting point, not an answer. Cyber capability moves every quarter, public benchmarks saturate, and a score with no transcript behind it is impossible to defend in a safety case or a deployment decision. CasabaCyberBench exists to produce capability measurements that are repeatable, environment-realistic, and backed by evidence a reviewer can inspect.
Comparative results
A leaderboard you can scrutinize
Results are normalized by task and rolled up across categories and suites, so models are compared on the same footing. Every number traces back to the runs that produced it.
How it's built
One substrate, three evaluation layers
From repeatable public suites to bespoke operational scenarios, every layer shares the same execution, grading, and evidence pipeline, so results stay comparable as fidelity increases.
Layer 01 / Baseline
Public benchmark suites
A stable harness runs established public suites for repeatable, comparable frontier-model testing, the reproducible baseline every engagement starts from.
CTF - real-vulnerability - bounded command execution
Layer 02 / Higher fidelity
Real-tooling evaluation packs
Higher-fidelity packs put models in front of real tooling and containerized or lab-backed targets, closing the gap between a benchmark score and operational reality.
real tools - containerized targets - lab-backed runs
Layer 03 / Operational
Private operational scenarios
Multi-step scenarios model realistic operations end to end, with milestone, path, and campaign-level grading, checkpointing, and uplift measurement for models and agents.
milestone - path - campaign scoring
The operator console
Launch evaluations and review the result
One console spans the benchmark catalog: select a suite and target model, inspect a task definition down to its validator and runtime, and launch a run. The same workflow supports public benchmark tasks, private scenarios, and targeted safety tests.
When a run finishes, the result is not just a score. Reviewers can see what the model was asked to do, how it approached the task, and why the outcome counted.
Capability coverage
What CyberBench can measure
CyberBench is built to measure the cyber and safety-relevant capabilities that matter as models become more autonomous, more tool-using, and more deeply integrated into enterprise workflows.
Discovery
Vulnerability discovery and exploitation
Measure whether a model can find, reason about, patch, or exploit vulnerabilities in realistic source-code and service environments. This includes public benchmark integrations, real-CVE tasks, CTF-style challenges, and source-level bug discovery suites.
Progression
Enterprise attack progression
Can the model connect evidence across identity, cloud, developer, workflow, and collaboration systems to move from initial access toward higher-value objectives?
Assistance
Harmful technical assistance
Will the model provide useful assistance for malware creation, evasion, persistence, abuse enablement, or other restricted cyber activity when prompted?
Agency
Autonomous agent misuse
If the model is acting as an agent with permissions and operational context, will it stay within bounds, or will it conceal state, bypass oversight, preserve access, or take actions against operator intent?
Consequence
Data access and consequence proof
Can the model turn access into a concrete consequence, such as reaching protected data, replaying recovered credentials, or proving control over a sensitive internal path?
Benchmark programs
Public baselines and private risk scenarios
Organizations can start with public benchmark suites to establish a baseline, then add private scenarios that reflect the systems, workflows, and harm categories that matter to them. Public results help compare models broadly. Private scenarios help answer the more important question: what could this model do in our environment?
Programs CyberBench supports
- Recurring model comparisons
- Deployment gates
- Red-team evaluations
- Frontier-model tracking
- Custom scenario development
The output can be summarized for leadership while preserving enough detail for security teams to review the evidence behind each result.
Public benchmarks measure the field. Private scenarios measure the risk to you.
Published research
Regular reports and a public leaderboard
We publish recurring frontier-model cyber capability reports with a comparative leaderboard across models and suites. The data is reproducible and the methodology is open, measurement you can scrutinize, not marketing.
Public benchmarks. Private scenarios. Evidence either way.
CasabaCyberBench powers our frontier-model cyber capability assessments and custom enterprise scenario programs. It is part of our technology alongside Nemesis. Talk to us about a benchmarking engagement.
Get in touch