What is Data Leakage?

What is data leakage?

Data leakage is the unchecked exfiltration of organizational data to a third party. It occurs through various means such as misconfigured databases, poorly protected network servers, phishing attacks, or even careless data handling.

Data leakage can happen accidentally: 82% of all organizations give third parties wide read access to their environments, which poses major security risks and serious privacy concerns. However, data leakage also occurs due to malicious activities including hacking, phishing, or insider threats where employees intentionally steal data.

100 Experts Weigh In on AI Security

Learn what leading teams are doing today to reduce AI threats tomorrow.

The rising threat of data leakage in machine learning (ML)

When training a model using a data set different from that of a large language model (LLM), machine learning or artificial intelligence (AI) bias can occur. This situation typically arises due to a mismanagement of the preprocessing phase of ML development. A typical example of ML leakage is using the mean and standard deviation of an entire training dataset instead of the entire training subset.

Example of a data leakage risk in a AWS Sagemaker Notebook

Data leakage occurs in machine learning models through target leakage or train-test contamination. In the latter, the data intended for testing the model leaks into the training set. If a model is exposed to test data during training, its performance metrics will be misleadingly high.

In target leakage, subsets used for training include information unavailable at the prediction phase of ML development. Although LLMs perform well under this scenario, they give stakeholders a false sense of model efficacy, leading to poor performance in real-world applications.

Types and Examples of Data Leakage in Machine Learning

Data leakage occurs when information that shouldn’t be available at prediction time is used to train a machine learning model, leading to overly optimistic performance and poor generalization.

Target Leakage
Occurs when the training data includes features that reveal the target label, even indirectly. This gives the model access to future information it wouldn't have in a real-world scenario.
Example: Including "payment status" when predicting loan default.
Train-Test Contamination
Happens when test data influences the training process, often due to improper data splitting. This leads to inflated performance metrics.
Example: Randomly splitting time-series data, allowing future values into the training set.
Preprocessing Leakage
Occurs when transformations like scaling or imputation are applied before splitting the data. This causes information from the test set to influence the training set.
Example: Normalizing the full dataset before splitting, which leaks statistical properties.
Feature Leakage
Happens when engineered features use information unavailable at prediction time. This includes temporal leakage, where features rely on future events.
Example: Creating a feature like "average spend over the past year" using future transactions.

Common causes of data leakage

Data leakage occurs for a variety of reasons; the following are some of the most common.

1. Misconfigured storage buckets
Cloud storage services like Amazon S3 or Azure Blob Storage can be unintentionally left open to the public, exposing sensitive files to anyone with the URL.

2. Over-permissioned identities
Users, services, or roles often have more access than they need. If compromised, these identities can be used to exfiltrate data across environments.

3. Hardcoded secrets in code
Secrets, API keys, and credentials stored in source code or configuration files can be easily discovered if that code is leaked or reused in unsafe ways.

4. Lack of encryption at rest or in transit
Data that is not properly encrypted is vulnerable to interception, especially when moving between services or across networks.

5. Shadow data stores
Untracked or forgotten databases and storage assets often go unmonitored and unpatched, leaving them vulnerable to exposure.

6. Publicly exposed services and APIs
Applications or APIs that are unintentionally accessible from the internet can leak data if they lack proper authentication or access controls.

7. Third-party integrations
External services with weak security practices or misconfigured access can become a path for data to leak out of your environment.

8. Insecure AI and ML pipelines
Training data containing sensitive information may be stored or processed in insecure locations, increasing the risk of leaks during experimentation or deployment.

State of AI in the Cloud [2025]

Data leakage is a major risk, especially in AI systems. Wiz’s State of AI Security Report 2025 uncovers how organizations are managing data exposure risks in AI, including vulnerabilities in AI-as-a-service providers.

Get the report

How to prevent leakage

1. Data Preprocessing and Sanitization

Anonymization and Redaction

Anonymization involves altering or removing personally identifiable information (PII) and sensitive data to prevent it from being linked back to individuals. Redaction is a more specific process that involves removing or obscuring sensitive parts of the data, such as credit card numbers, Social Security numbers, or addresses.

Without proper anonymization and redaction, AI models can "memorize" sensitive data from the training set, which could be inadvertently reproduced in model outputs. This is especially dangerous if the model is used in public or client-facing applications.

Best Practices:
- Use tokenization, hashing, or encryption techniques to anonymize data.
- Ensure that any redacted data is permanently removed from both structured (e.g., databases) and unstructured (e.g., text files) datasets before training.
- Implement differential privacy (discussed later) to further reduce the risk of individual data exposure.

Data Minimization

Data minimization involves only collecting and using the smallest necessary dataset to achieve the AI model’s objective. The less data collected, the lower the risk of sensitive information being leaked.

Collecting excessive data increases the risk surface for breaches and the chances of leaking sensitive information. By using only what's necessary, you also ensure compliance with privacy regulations like GDPR or CCPA.

Best Practices:
- Conduct a data audit to assess which data points are essential for training.
- Implement policies to discard non-essential data early in the preprocessing pipeline.
- Regularly review the data collection process to ensure that no unnecessary data is being retained.

Pro tip

69% of security leaders list data privacy as key feature to look for in an AI security solution

Learn more

2. Model Training Safeguards

Proper Data Splitting

Data splitting separates the dataset into training, validation, and test sets. The training set teaches the model, while the validation and test sets ensure the model’s accuracy without overfitting.

If data is improperly split (e.g., the same data is present in both the training and test sets), the model can effectively “memorize” the test set, leading to overestimation of its performance and potential exposure of sensitive information in both training and prediction phases.

Best Practices:
- Randomize datasets during splitting to ensure no overlap between the training, validation, and test sets.
- Use techniques like k-fold cross-validation to robustly assess model performance without data leakage.

Regularization Techniques

Regularization techniques are employed during training to prevent overfitting, where the model becomes too specific to the training data and learns to “memorize” rather than generalize from it. Overfitting increases the likelihood of data leakage since the model can memorize sensitive information from the training data and reproduce it during inference.

Best Practices:
- Dropout: Randomly drop certain units (neurons) from the neural network during training, forcing the model to generalize rather than memorize patterns.
- Weight Decay (L2 Regularization): Penalize large weights during training to prevent the model from fitting too closely to the training data.
- Early Stopping: Monitor model performance on a validation set and stop training when the model's performance starts to degrade due to overfitting.

Differential Privacy

Differential privacy adds controlled noise to the data or model outputs, ensuring that it becomes difficult for attackers to extract information about any individual data point in the dataset.

By applying differential privacy, AI models are less likely to leak details of specific individuals during training or prediction, providing a layer of protection against adversarial attacks or unintended data leakage.

Best Practices:
- Add Gaussian or Laplace noise to training data, model gradients, or final predictions to obscure individual data contributions.
- Use frameworks like TensorFlow Privacy or PySyft to apply differential privacy in practice.

3. Secure Model Deployment

Tenant isolation

In a multi-tenant environment, tenant isolation creates a logical or physical boundary between each tenant's data, making it impossible for one tenant to access or manipulate another's sensitive information. By isolating each tenant's data, businesses can prevent unauthorized access, reduce the risk of data breaches, and ensure compliance with data protection regulations.

Tenant isolation provides an additional layer of security, giving organizations peace of mind knowing that their sensitive AI training data is protected from potential leaks or unauthorized access.

Best Practices:
- Logical Separation: Use virtualization techniques like containers or virtual machines (VMs) to ensure that each tenant’s data and processing are isolated from one another.
- Access Controls: Implement strict access control policies to ensure that each tenant can only access their own data and resources.
- Encryption and Key Management: Use tenant-specific encryption keys to further segregate data, ensuring that even if a breach occurs, data from other tenants remains secure.
- Resource Throttling and Monitoring: Prevent tenants from exhausting shared resources by enforcing resource limits and monitoring for anomalous behavior that might compromise the system’s isolation.

Output Sanitization

Output sanitization involves implementing checks and filters on model outputs to prevent the accidental exposure of sensitive data, especially in natural language processing (NLP) and generative models.

In some cases, the model might reproduce sensitive information it encountered during training (e.g., names or credit card numbers). Sanitizing outputs ensures that no sensitive data is exposed.

Best Practices:
- Use pattern-matching algorithms to identify and redact PII (e.g., email addresses, phone numbers) in model outputs.
- Set thresholds on probabilistic outputs to prevent the model from overly confident predictions that could expose sensitive details.

4. Organizational Practices

Employee Training

Employee training ensures that all individuals involved in the development, deployment, and maintenance of AI models understand the risks of data leakage and the best practices to mitigate them. Many data breaches occur due to human error or oversight. Proper training can prevent accidental exposure of sensitive information or model vulnerabilities.

Best Practices:
- Provide regular cybersecurity and data privacy training for all employees handling AI models and sensitive data.
- Update staff on emerging AI security risks and new preventive measures.

Data Governance Policies

Data governance policies set clear guidelines for how data should be collected, processed, stored, and accessed across the organization, ensuring that security practices are consistently applied.

A well-defined governance policy ensures that data handling is standardized and compliant with privacy laws like GDPR or HIPAA, reducing the chances of leakage.

Best Practices:
- Define data ownership and establish clear protocols for handling sensitive data at every stage of AI development.
- Regularly review and update governance policies to reflect new risks and regulatory requirements.

5. Leverage AI Security Posture Management (AI-SPM) tools

AI-SPM solutions provide visibility and control over critical components of AI security, including the data used for training/inference, model integrity, and access to deployed models. By incorporating an AI-SPM tool, organizations can proactively manage the security posture of their AI models, minimizing the risk of data leakage and ensuring robust AI system governance.

How AI-SPM helps prevent ML model leakage:
- Discover and inventory all AI applications, models, and associated resources
- Identify vulnerabilities in the AI supply chain and misconfigurations that could lead to data leakage
- Monitor for sensitive data across the AI stack, including training data, libraries, APIs, and data pipelines
- Detect anomalies and potential data leakage in real-time
- Implement guardrails and security controls specific to AI systems
- Conduct regular audits and assessments of AI applications

Preventing data leakage with Wiz

Data leakage in the cloud often results from a combination of overlooked risks – exposed storage, overly permissive identities, hardcoded secrets, and unmonitored shadow data. As cloud environments grow and organizations adopt AI, these risks only multiply.

Wiz helps teams stay ahead with built-in Data Security Posture Management (DSPM) that continuously discovers and classifies sensitive data across cloud workloads, storage, and applications. Wiz DSPM doesn’t stop at visibility – it maps sensitive data to real exposure paths, identifying which assets are publicly accessible, unencrypted, or connected to over-permissioned identities. This context helps teams prioritize and fix what matters most.

As organizations begin training and deploying AI models, Wiz DSPM extends protection to AI pipelines as well – detecting sensitive training data in cloud environments and flagging leaks across services, APIs, and dev workflows.

By unifying data discovery with real-world risk context, Wiz enables security teams to proactively prevent data leakage – whether the risk originates in cloud infrastructure, developer code, or AI workloads.