Privacy Model
EdgeML is built so that sensitive data never leaves the device. The platform is designed around data minimization and local training.
Data Flow Architecture
Privacy Guarantees
What stays on-device
- Raw user data (texts, images, sensor data)
- Feature extraction results
- Local training batches
What is shared
- Model weight updates (or deltas)
- Training metadata (sample counts, basic metrics)
Update Privacy Comparison
Threat model (MVP)
- Prevent server-side collection of raw data
- Reduce data exposure by keeping training local
- Provide audit trails for model changes
Privacy-Preserving Mechanisms
1. Local Training
All training happens on the device, never in the cloud:
def train_locally(base_model):
"""Training function runs on-device only"""
model = load_model(base_model)
# Local data NEVER leaves the device
local_data = load_local_dataset() # Private user data
# Train locally
for epoch in range(3):
for batch in local_data:
loss = model.train_step(batch)
# Return only model weights, NOT data
return model.state_dict()
Privacy benefit: Raw data (images, text, sensor readings) never transmitted over the network.
2. Weight Updates Only
EdgeML transmits only model parameters, not training data:
# What gets uploaded:
{
"model_id": "fraud-detector",
"version": "1.0.0",
"updates": {
"layer1.weight": [...], # Numerical weights only
"layer1.bias": [...],
"layer2.weight": [...]
},
"sample_count": 1000, # Aggregate count, no individual records
"metrics": {
"loss": 0.42,
"accuracy": 0.89
}
}
# What NEVER gets uploaded:
# - Individual training examples
# - User identifiers
# - Sensitive features (names, addresses, photos)
Privacy benefit: Even if network traffic is intercepted, raw data is not exposed.
3. Differential Privacy (Planned)
Differential privacy adds calibrated noise to model updates to prevent individual data points from being reverse-engineered.
How it works:
Example:
# Without differential privacy
original_weight_update = 0.453
# With differential privacy (ε=1.0)
noisy_update = original_weight_update + np.random.laplace(0, sensitivity/epsilon)
# → 0.467 (slightly perturbed)
Privacy benefit:
- Individual training examples cannot be recovered from model updates
- Provides mathematical privacy guarantees (ε-differential privacy)
- Protects against membership inference attacks
Trade-off: Adding noise slightly reduces model accuracy, but empirically the impact is small (1-3% accuracy loss) for reasonable privacy budgets [Abadi et al., 2016].
EdgeML's planned implementation (not in MVP):
- User-configurable privacy budget (ε)
- Automatic noise calibration based on model architecture
- Privacy accounting across multiple rounds
4. Secure Aggregation (Planned)
Secure aggregation uses cryptographic techniques to compute the average of updates without the server seeing individual contributions.
How it works:
Privacy benefit:
- Server cannot see individual device updates
- Even a compromised server cannot isolate a single device's contribution
- Protects against "honest-but-curious" server attacks
Cryptographic protocol (Bonawitz et al., 2017):
- Devices generate pairwise shared secrets using Diffie-Hellman
- Each device masks its update with these secrets
- Server sums masked updates
- Masks cancel out in aggregation, revealing only the sum
EdgeML's planned implementation (not in MVP):
- Based on Google's secure aggregation protocol [Bonawitz et al., 2017]
- Threshold cryptography for fault tolerance
- Efficient for 100-10,000 devices
Privacy Attacks and Defenses
Model Inversion Attacks
Attack: Reconstruct training data by analyzing model parameters.
Example: Given a facial recognition model, can you generate faces it was trained on?
EdgeML's defense:
- Aggregate updates from 100+ devices → individual contributions are diluted
- Differential privacy (planned) adds noise to prevent reconstruction
- Regular model retraining prevents memorization
Research: Model inversion is most effective on models trained on very few samples (<100). Federated learning with large cohorts (1000+ devices) makes this attack impractical [Fredrikson et al., 2015].
Membership Inference Attacks
Attack: Determine if a specific data point was in the training set.
Example: Given a medical model, can you determine if Alice's health record was used for training?
EdgeML's defense:
- Aggregation across many devices reduces signal
- Differential privacy (planned) provides provable protection
- Limit local epochs to prevent overfitting
Research: Membership inference success drops from 80% (centralized) to 50-60% (federated with 100+ devices) [Shokri et al., 2017].
Poisoning Attacks
Attack: Malicious device submits corrupted updates to degrade model performance or inject backdoors.
Example: Attacker sends updates that cause the model to misclassify specific inputs.
EdgeML's defense:
- Statistical outlier detection: Reject updates far from the median
- Byzantine-robust aggregation: Use median or trimmed mean instead of simple average
- Reputation systems: Track device reliability over time
Planned enhancements:
- Secure aggregation prevents attacker from seeing other updates
- Differential privacy limits impact of individual malicious updates
Compliance and Regulations
GDPR (General Data Protection Regulation)
Key requirements:
- Data minimization: Only collect necessary data → ✅ EdgeML keeps data on-device
- Right to erasure: Users can delete their data → ✅ Data stays local, user controls deletion
- Data portability: Users can export their data → ✅ Data never centralized
- Purpose limitation: Data used only for specified purpose → ✅ Model training only
EdgeML's approach:
- Raw data never leaves the device → no central data processing
- Model updates are not "personal data" under GDPR (anonymized aggregates)
- Users can opt out without affecting others
HIPAA (Health Insurance Portability and Accountability Act)
Key requirements:
- Protected Health Information (PHI): Must be secured and not disclosed
- Minimum necessary: Only access minimum data needed
EdgeML's approach:
- PHI stays on-device (e.g., patient records remain in hospital systems)
- Only model updates (not PHI) transmitted to server
- Enables multi-hospital collaboration without centralizing patient data
Use case: Hospital consortium training disease prediction model:
# Hospital A
client_a = FederatedClient(api_key="...")
client_a.train_from_remote(
model="disease-predictor",
local_train_fn=train_on_local_patients # PHI stays local
)
# Hospital B
client_b = FederatedClient(api_key="...")
client_b.train_from_remote(
model="disease-predictor",
local_train_fn=train_on_local_patients # PHI stays local
)
# Server aggregates WITHOUT seeing patient data
federation.train(model="disease-predictor", min_updates=10)
CCPA (California Consumer Privacy Act)
Key requirements:
- Right to know: Users can see what data is collected
- Right to delete: Users can request data deletion
- Right to opt-out: Users can opt out of data "sale"
EdgeML's approach:
- Transparent: Users see that only model updates (not data) are shared
- Deletion: User data stays local, can be deleted without affecting system
- No "sale": Model updates are not sold or shared with third parties
Privacy Configuration
Client-Side Controls
client = FederatedClient(
api_key="ek_live_...",
privacy_budget=1.0, # Differential privacy ε (planned)
max_local_epochs=3, # Limit overfitting
sample_fraction=0.1, # Use only 10% of local data
opt_in_required=True # User must explicitly consent
)
Server-Side Controls
federation = Federation(
api_key="ek_live_...",
min_updates=100, # Require many devices for aggregation
outlier_threshold=3.0, # Reject updates >3σ from median
secure_aggregation=True # Enable cryptographic aggregation (planned)
)
Privacy vs. Utility Trade-offs
Guidelines for balancing privacy and utility:
| Scenario | Privacy Settings | Expected Impact |
|---|---|---|
| Public dataset (MNIST) | No DP, plain aggregation | No accuracy loss |
| Internal company data | Light DP (ε=10), plain aggregation | <1% accuracy loss |
| Healthcare data (HIPAA) | Strong DP (ε=1), secure aggregation | 2-5% accuracy loss |
| Financial data (PCI-DSS) | Strong DP (ε=0.5), secure aggregation | 5-10% accuracy loss |
Tuning recommendations:
- Start without differential privacy to establish baseline accuracy
- Gradually increase privacy (reduce ε) while monitoring model quality
- Use more devices and more rounds to compensate for privacy overhead
References
-
Abadi, M., et al. (2016). "Deep Learning with Differential Privacy." ACM CCS. [arXiv:1607.00133]
- Practical differential privacy for deep learning
-
Bonawitz, K., et al. (2017). "Practical Secure Aggregation for Privacy-Preserving Machine Learning." ACM CCS. [arXiv:1611.04482]
- Cryptographic protocol for secure aggregation at scale
-
Fredrikson, M., et al. (2015). "Model Inversion Attacks that Exploit Confidence Information." ACM CCS. [PDF]
- Demonstrates privacy risks of model inversion
-
Shokri, R., et al. (2017). "Membership Inference Attacks Against Machine Learning Models." IEEE S&P. [arXiv:1610.05820]
- Quantifies privacy leakage through membership inference
-
Kairouz, P., et al. (2021). "Advances and Open Problems in Federated Learning." [arXiv:1912.04977]
- Comprehensive survey including privacy challenges
Next Steps
- Federated Learning - Core concepts
- Training Rounds - How rounds work
- Model Lifecycle - Model versioning and deployment
- Quickstart Guide - Build your first federated app