Technical Capability Brief
Custom Model & Training Security
Organizations that train, fine-tune, or self-host models face security challenges beyond what API consumers encounter. When you control the model weights, the training pipeline, or the hosting infrastructure, the attack surface expands significantly - but so do the testing approaches available.
What we test
Five attack surfaces
Deep RAI and Safety Auditing
When model weights are available, we can use gradient-based techniques to systematically find inputs that bypass safety alignment, surfacing jailbreaks and RAI violations that prompt-level probing would miss. Much more thorough than prompt-level red teaming alone.
Fine-Tuning Safety Regression
Organizations fine-tuning a base model on their own data can inadvertently degrade the model's built-in safety training. We evaluate whether fine-tuning has introduced new failure modes, weakened refusal behaviors, or created gaps that didn't exist in the base model.
Model Extraction and Theft via API
When a model API exposes confidence scores, log probabilities, or detailed output distributions, attackers can use those signals to reconstruct a functional copy of the model. We assess how much information the API leaks and what practical theft risk that creates.
Training Data Extraction
Models can memorize and reproduce sensitive training data under the right conditions. We test whether PII, proprietary content, or other sensitive material from the training corpus can be recovered from the model's outputs. For models accepting ongoing training data, we also evaluate whether training data poisoning could degrade model behavior or introduce backdoors.
Model Supply Chain
For teams pulling open weights from model registries: backdoored or tampered model artifacts are a real risk. We assess the integrity of the model pipeline from source weights through deployment, including serialization format risks and provenance verification.
How we approach it
Testing varies by access level
The testing approach varies significantly based on what access is available. With model weights, we can run gradient-based adversarial techniques that are fundamentally more thorough than prompt-level testing. With API access, we focus on information leakage, extraction risk, and behavioral analysis. With access to the training pipeline, we evaluate security across the full model development process.
We build custom tooling for each engagement based on the specific model architecture, access level, and risk profile.
What we find
Typical finding patterns
Fine-Tuning Induced Safety Regression
Custom fine-tuning frequently weakens the base model's safety training in unexpected ways, creating a false sense of security.
API Information Leakage
Model APIs that expose confidence scores or log probabilities often leak enough information for practical model extraction, especially for smaller or specialized models.
Training Data Memorization
Models trained on proprietary data can reproduce specific examples from their training set when prompted correctly, creating both privacy and IP risks.
Supply Chain Integrity Gaps
Organizations pulling open model weights often lack verification processes for model artifact integrity.
Why it matters
Different risks, different testing
As more organizations move from consuming model APIs to training, fine-tuning, and hosting their own models, the security surface expands dramatically. The testing methodologies for these scenarios are fundamentally different from application-layer AI security, and the risks - model theft, training data exposure, safety regression - can have significant business and regulatory consequences.
Need your model tested?
Whether you're fine-tuning, self-hosting, or building from scratch, we have the methodology. Let's talk.
Get in touch