Skip to main content

Privacy Model

EdgeML is built so that sensitive data never leaves the device. The platform is designed around data minimization and local training.

Data Flow Architecture

Privacy Guarantees

What stays on-device

  • Raw user data (texts, images, sensor data)
  • Feature extraction results
  • Local training batches

What is shared

  • Model weight updates (or deltas)
  • Training metadata (sample counts, basic metrics)

Update Privacy Comparison

Threat model (MVP)

  • Prevent server-side collection of raw data
  • Reduce data exposure by keeping training local
  • Provide audit trails for model changes

Privacy-Preserving Mechanisms

1. Local Training

All training happens on the device, never in the cloud:

def train_locally(base_model):
"""Training function runs on-device only"""
model = load_model(base_model)

# Local data NEVER leaves the device
local_data = load_local_dataset() # Private user data

# Train locally
for epoch in range(3):
for batch in local_data:
loss = model.train_step(batch)

# Return only model weights, NOT data
return model.state_dict()

Privacy benefit: Raw data (images, text, sensor readings) never transmitted over the network.

2. Weight Updates Only

EdgeML transmits only model parameters, not training data:

# What gets uploaded:
{
"model_id": "fraud-detector",
"version": "1.0.0",
"updates": {
"layer1.weight": [...], # Numerical weights only
"layer1.bias": [...],
"layer2.weight": [...]
},
"sample_count": 1000, # Aggregate count, no individual records
"metrics": {
"loss": 0.42,
"accuracy": 0.89
}
}

# What NEVER gets uploaded:
# - Individual training examples
# - User identifiers
# - Sensitive features (names, addresses, photos)

Privacy benefit: Even if network traffic is intercepted, raw data is not exposed.

3. Differential Privacy (Planned)

Differential privacy adds calibrated noise to model updates to prevent individual data points from being reverse-engineered.

How it works:

Example:

# Without differential privacy
original_weight_update = 0.453

# With differential privacy (ε=1.0)
noisy_update = original_weight_update + np.random.laplace(0, sensitivity/epsilon)
# → 0.467 (slightly perturbed)

Privacy benefit:

  • Individual training examples cannot be recovered from model updates
  • Provides mathematical privacy guarantees (ε-differential privacy)
  • Protects against membership inference attacks

Trade-off: Adding noise slightly reduces model accuracy, but empirically the impact is small (1-3% accuracy loss) for reasonable privacy budgets [Abadi et al., 2016].

EdgeML's planned implementation (not in MVP):

  • User-configurable privacy budget (ε)
  • Automatic noise calibration based on model architecture
  • Privacy accounting across multiple rounds

4. Secure Aggregation (Planned)

Secure aggregation uses cryptographic techniques to compute the average of updates without the server seeing individual contributions.

How it works:

Privacy benefit:

  • Server cannot see individual device updates
  • Even a compromised server cannot isolate a single device's contribution
  • Protects against "honest-but-curious" server attacks

Cryptographic protocol (Bonawitz et al., 2017):

  1. Devices generate pairwise shared secrets using Diffie-Hellman
  2. Each device masks its update with these secrets
  3. Server sums masked updates
  4. Masks cancel out in aggregation, revealing only the sum

EdgeML's planned implementation (not in MVP):

  • Based on Google's secure aggregation protocol [Bonawitz et al., 2017]
  • Threshold cryptography for fault tolerance
  • Efficient for 100-10,000 devices

Privacy Attacks and Defenses

Model Inversion Attacks

Attack: Reconstruct training data by analyzing model parameters.

Example: Given a facial recognition model, can you generate faces it was trained on?

EdgeML's defense:

  • Aggregate updates from 100+ devices → individual contributions are diluted
  • Differential privacy (planned) adds noise to prevent reconstruction
  • Regular model retraining prevents memorization

Research: Model inversion is most effective on models trained on very few samples (<100). Federated learning with large cohorts (1000+ devices) makes this attack impractical [Fredrikson et al., 2015].

Membership Inference Attacks

Attack: Determine if a specific data point was in the training set.

Example: Given a medical model, can you determine if Alice's health record was used for training?

EdgeML's defense:

  • Aggregation across many devices reduces signal
  • Differential privacy (planned) provides provable protection
  • Limit local epochs to prevent overfitting

Research: Membership inference success drops from 80% (centralized) to 50-60% (federated with 100+ devices) [Shokri et al., 2017].

Poisoning Attacks

Attack: Malicious device submits corrupted updates to degrade model performance or inject backdoors.

Example: Attacker sends updates that cause the model to misclassify specific inputs.

EdgeML's defense:

  • Statistical outlier detection: Reject updates far from the median
  • Byzantine-robust aggregation: Use median or trimmed mean instead of simple average
  • Reputation systems: Track device reliability over time

Planned enhancements:

  • Secure aggregation prevents attacker from seeing other updates
  • Differential privacy limits impact of individual malicious updates

Compliance and Regulations

GDPR (General Data Protection Regulation)

Key requirements:

  • Data minimization: Only collect necessary data → ✅ EdgeML keeps data on-device
  • Right to erasure: Users can delete their data → ✅ Data stays local, user controls deletion
  • Data portability: Users can export their data → ✅ Data never centralized
  • Purpose limitation: Data used only for specified purpose → ✅ Model training only

EdgeML's approach:

  • Raw data never leaves the device → no central data processing
  • Model updates are not "personal data" under GDPR (anonymized aggregates)
  • Users can opt out without affecting others

HIPAA (Health Insurance Portability and Accountability Act)

Key requirements:

  • Protected Health Information (PHI): Must be secured and not disclosed
  • Minimum necessary: Only access minimum data needed

EdgeML's approach:

  • PHI stays on-device (e.g., patient records remain in hospital systems)
  • Only model updates (not PHI) transmitted to server
  • Enables multi-hospital collaboration without centralizing patient data

Use case: Hospital consortium training disease prediction model:

# Hospital A
client_a = FederatedClient(api_key="...")
client_a.train_from_remote(
model="disease-predictor",
local_train_fn=train_on_local_patients # PHI stays local
)

# Hospital B
client_b = FederatedClient(api_key="...")
client_b.train_from_remote(
model="disease-predictor",
local_train_fn=train_on_local_patients # PHI stays local
)

# Server aggregates WITHOUT seeing patient data
federation.train(model="disease-predictor", min_updates=10)

CCPA (California Consumer Privacy Act)

Key requirements:

  • Right to know: Users can see what data is collected
  • Right to delete: Users can request data deletion
  • Right to opt-out: Users can opt out of data "sale"

EdgeML's approach:

  • Transparent: Users see that only model updates (not data) are shared
  • Deletion: User data stays local, can be deleted without affecting system
  • No "sale": Model updates are not sold or shared with third parties

Privacy Configuration

Client-Side Controls

client = FederatedClient(
api_key="ek_live_...",
privacy_budget=1.0, # Differential privacy ε (planned)
max_local_epochs=3, # Limit overfitting
sample_fraction=0.1, # Use only 10% of local data
opt_in_required=True # User must explicitly consent
)

Server-Side Controls

federation = Federation(
api_key="ek_live_...",
min_updates=100, # Require many devices for aggregation
outlier_threshold=3.0, # Reject updates >3σ from median
secure_aggregation=True # Enable cryptographic aggregation (planned)
)

Privacy vs. Utility Trade-offs

Guidelines for balancing privacy and utility:

ScenarioPrivacy SettingsExpected Impact
Public dataset (MNIST)No DP, plain aggregationNo accuracy loss
Internal company dataLight DP (ε=10), plain aggregation<1% accuracy loss
Healthcare data (HIPAA)Strong DP (ε=1), secure aggregation2-5% accuracy loss
Financial data (PCI-DSS)Strong DP (ε=0.5), secure aggregation5-10% accuracy loss

Tuning recommendations:

  • Start without differential privacy to establish baseline accuracy
  • Gradually increase privacy (reduce ε) while monitoring model quality
  • Use more devices and more rounds to compensate for privacy overhead

References

  1. Abadi, M., et al. (2016). "Deep Learning with Differential Privacy." ACM CCS. [arXiv:1607.00133]

    • Practical differential privacy for deep learning
  2. Bonawitz, K., et al. (2017). "Practical Secure Aggregation for Privacy-Preserving Machine Learning." ACM CCS. [arXiv:1611.04482]

    • Cryptographic protocol for secure aggregation at scale
  3. Fredrikson, M., et al. (2015). "Model Inversion Attacks that Exploit Confidence Information." ACM CCS. [PDF]

    • Demonstrates privacy risks of model inversion
  4. Shokri, R., et al. (2017). "Membership Inference Attacks Against Machine Learning Models." IEEE S&P. [arXiv:1610.05820]

    • Quantifies privacy leakage through membership inference
  5. Kairouz, P., et al. (2021). "Advances and Open Problems in Federated Learning." [arXiv:1912.04977]

    • Comprehensive survey including privacy challenges

Next Steps