Custom Model & Training Security

Organizations that train, fine-tune, or self-host models face security challenges beyond what API consumers encounter. When you control the model weights, the training pipeline, or the hosting infrastructure, the attack surface expands significantly - but so do the testing approaches available.

Five attack surfaces

Deep RAI and Safety Auditing

When model weights are available, we can use gradient-based techniques to systematically find inputs that bypass safety alignment, surfacing jailbreaks and RAI violations that prompt-level probing would miss. Much more thorough than prompt-level red teaming alone.

Fine-Tuning Safety Regression

Organizations fine-tuning a base model on their own data can inadvertently degrade the model's built-in safety training. We evaluate whether fine-tuning has introduced new failure modes, weakened refusal behaviors, or created gaps that didn't exist in the base model.

Model Extraction and Theft via API

When a model API exposes confidence scores, log probabilities, or detailed output distributions, attackers can use those signals to reconstruct a functional copy of the model. We assess how much information the API leaks and what practical theft risk that creates.

Training Data Extraction

Models can memorize and reproduce sensitive training data under the right conditions. We test whether PII, proprietary content, or other sensitive material from the training corpus can be recovered from the model's outputs. For models accepting ongoing training data, we also evaluate whether training data poisoning could degrade model behavior or introduce backdoors.

Model Supply Chain

For teams pulling open weights from model registries: backdoored or tampered model artifacts are a real risk. We assess the integrity of the model pipeline from source weights through deployment, including serialization format risks and provenance verification.

Testing varies by access level

The testing approach varies significantly based on what access is available. With model weights, we can run gradient-based adversarial techniques that are fundamentally more thorough than prompt-level testing. With API access, we focus on information leakage, extraction risk, and behavioral analysis. With access to the training pipeline, we evaluate security across the full model development process.

We build custom tooling for each engagement based on the specific model architecture, access level, and risk profile.

Typical finding patterns

Fine-Tuning Induced Safety Regression

Custom fine-tuning frequently weakens the base model's safety training in unexpected ways, creating a false sense of security.

API Information Leakage

Model APIs that expose confidence scores or log probabilities often leak enough information for practical model extraction, especially for smaller or specialized models.

Training Data Memorization

Models trained on proprietary data can reproduce specific examples from their training set when prompted correctly, creating both privacy and IP risks.

Supply Chain Integrity Gaps

Organizations pulling open model weights often lack verification processes for model artifact integrity.

Different risks, different testing

As more organizations move from consuming model APIs to training, fine-tuning, and hosting their own models, the security surface expands dramatically. The testing methodologies for these scenarios are fundamentally different from application-layer AI security, and the risks - model theft, training data exposure, safety regression - can have significant business and regulatory consequences.

Need your model tested?

Whether you're fine-tuning, self-hosting, or building from scratch, we have the methodology. Let's talk.

Get in touch