SecureML is an open-source Python library that integrates with popular machine learning frameworks like TensorFlow and PyTorch. It provides developers with easy-to-use utilities to ensure that AI agents handle sensitive data in compliance with data protection regulations.
- Data Anonymization Utilities:
- K-anonymity implementation with adaptive generalization
- Pseudonymization with format-preserving encryption
- Configurable data masking with statistical property preservation
- Hierarchical data generalization with taxonomy support
- Automatic sensitive data detection
- Privacy-Preserving Training Methods:
- Differential privacy integration with PyTorch (via Opacus) and TensorFlow (via TF Privacy)
- Federated learning with Flower, allowing training on distributed data without centralization
- Support for secure aggregation and privacy-preserving federated learning
- Compliance Checkers: Tools to analyze datasets and model configurations for potential privacy risks
- Synthetic Data Generation: Utilities to create synthetic datasets that mimic real data
- Regulation-Specific Presets:
- Pre-configured YAML settings aligned with major regulations (GDPR, CCPA, HIPAA)
- Detailed compliance requirements for each regulation
- Customizable identifiers for personal data and sensitive information
- Integration with compliance checking functionality
- Audit Trails and Reporting: Automatic logging of privacy measures and model decisions
Disclaimer: Due to Tensorflow-privacy compatibility issues, SecureML is only available up to Python 3.11. We will update as soon as Tensorflow-privacy releases a version compatible to Python 3.12+
With pip (Python 3.11):
pip install secureml
# For generating PDF reports for compliance and audit trails
pip install secureml[pdf]
# For secure key management with HashiCorp Vault
pip install secureml[vault]
# For all optional components
pip install secureml[pdf,vault]
Anonymizing a dataset to comply with privacy regulations:
import pandas as pd
from secureml import anonymize
# Load your dataset
data = pd.DataFrame({
"name": ["John Doe", "Jane Smith", "Bob Johnson"],
"age": [32, 45, 28],
"email": ["john.doe@example.com", "jane.smith@example.com", "bob.j@example.com"],
"ssn": ["123-45-6789", "987-65-4321", "456-78-9012"],
"zip_code": ["10001", "94107", "60601"],
"income": [75000, 82000, 65000]
})
# Anonymize using k-anonymity
anonymized_data = anonymize(
data,
method="k-anonymity",
k=2,
sensitive_columns=["name", "email", "ssn"]
)
print(anonymized_data)
SecureML includes built-in presets for major regulations (GDPR, CCPA, HIPAA) that define the compliance requirements specific to each regulation:
import pandas as pd
from secureml import check_compliance
# Load your dataset
data = pd.read_csv("your_dataset.csv")
# Model configuration
model_config = {
"model_type": "neural_network",
"input_features": ["age", "income", "zip_code"],
"output": "purchase_likelihood",
"training_method": "standard_backprop"
}
# Check compliance with GDPR
report = check_compliance(
data=data,
model_config=model_config,
regulation="GDPR"
)
# View compliance issues
if report.has_issues():
print("Compliance issues found:")
for issue in report.issues:
print(f"- {issue['component']}: {issue['issue']} ({issue['severity']})")
print(f" Recommendation: {issue['recommendation']}")
Train a model with differential privacy guarantees:
import torch.nn as nn
import pandas as pd
from secureml import differentially_private_train
# Create a simple PyTorch model
model = nn.Sequential(
nn.Linear(10, 64),
nn.ReLU(),
nn.Linear(64, 2),
nn.Softmax(dim=1)
)
# Load your dataset
data = pd.read_csv("your_dataset.csv")
# Train with differential privacy
private_model = differentially_private_train(
model=model,
data=data,
epsilon=1.0, # Privacy budget
delta=1e-5, # Privacy delta parameter
epochs=10,
batch_size=64
)
Generate synthetic data that maintains the statistical properties of the original data:
import pandas as pd
from secureml import generate_synthetic_data
# Load your dataset
data = pd.read_csv("your_dataset.csv")
# Generate synthetic data
synthetic_data = generate_synthetic_data(
template=data,
num_samples=1000,
method="statistical", # Options: simple, statistical, sdv-copula, gan
sensitive_columns=["name", "email", "ssn"]
)
print(synthetic_data.head())
For detailed documentation, examples, and API reference, visit our documentation.
Contributions are welcome! Please feel free to submit a Pull Request or Issue. Our focus is expanding supported legislations beyond GDPR, CCPA, and HIPAA. You can help us with that!
This project is licensed under the MIT License - see the LICENSE file for details.