Model Lifecycle
Models move through a consistent lifecycle in EdgeML: creation, training, versioning, deployment, and rollback.
Lifecycle State Machine
Lifecycle Stages
- Create: register a model in the catalog.
- Train: generate a new artifact via federated rounds.
- Version: publish a semantic version (e.g., 1.0.0).
- Deploy: ship to device cohorts with rollout controls.
- Monitor: track performance and stability.
- Rollback: revert to a safe version if needed.
Progressive Rollout Flow
Figure 2: Progressive rollout strategy gradually increases deployment percentage while monitoring metrics at each stage. Any issues trigger automatic rollback to the previous stable version.
Why it matters
- Enables safe, repeatable model releases
- Improves auditability and compliance
- Provides clear ownership of model changes
Detailed Stage Breakdown
1. Create: Model Registration
Register a new model in EdgeML's catalog:
registry = edgeml.ModelRegistry(api_key="ek_live_...")
model = registry.ensure_model(
name="fraud-detector",
framework="pytorch",
use_case="fraud_detection",
description="Credit card fraud detection model"
)
At this stage:
- Model metadata is stored (name, framework, use case)
- Unique model ID is generated
- No trained artifacts exist yet
- Model is in Draft state
2. Train: Federated Training
Generate a new model version through federated rounds:
federation = edgeml.Federation(api_key="ek_live_...")
result = federation.train(
model="fraud-detector",
rounds=20,
min_updates=100,
base_version="1.0.0",
new_version="1.1.0"
)
During training:
- Devices participate in multiple rounds
- Server aggregates updates using FedAvg
- Training metrics (loss, accuracy) are tracked
- Model transitions to Training state
After training completes:
- New model artifact is stored in S3/MinIO
- Training metadata is persisted (round count, participant count, convergence metrics)
- Model moves to Trained state
3. Version: Publication
Publish a trained model to make it available for deployment:
registry.publish_version(
model_id=model['id'],
version="1.1.0"
)
Publishing:
- Marks the version as production-ready
- Converts model to deployment formats (ONNX, TFLite, CoreML)
- Computes model checksums for integrity verification
- Model transitions to Published state
Semantic versioning guidelines:
- Major (2.0.0): Breaking changes to input/output schema
- Minor (1.1.0): New features, backward compatible
- Patch (1.0.1): Bug fixes, no new features
4. Deploy: Progressive Rollout
Deploy a published version to edge devices with gradual rollout:
deployment = federation.deploy(
model_id=model['id'],
version="1.1.0",
rollout_percentage=10, # Start at 10%
target_percentage=100, # Goal: 100%
increment_step=10, # Increase by 10% each step
start_immediately=True
)
Deployment strategy:
- Canary (10%): Deploy to 10% of devices, monitor for 24-48 hours
- Gradual increase (25%, 50%, 75%): Increase rollout if metrics are healthy
- Full rollout (100%): Deploy to all devices once validated
Rollout controls:
- Device sampling: Random selection or targeted cohorts (e.g., iOS only, specific regions)
- Health checks: Automatic monitoring of error rates, latency, accuracy
- Pause/resume: Manual controls to halt rollout if issues arise
5. Monitor: Health Tracking
EdgeML automatically tracks deployment health:
Key metrics:
- Download success rate: % of devices that successfully downloaded the model
- Inference latency: Model prediction time (p50, p95, p99)
- Error rate: % of predictions that failed
- Accuracy drift: Comparison to validation set accuracy
Alerting thresholds:
# Example: Alert if error rate > 5%
if error_rate > 0.05:
trigger_rollback()
Monitor through the dashboard:
- Real-time deployment progress (10% → 25% → 50% → 100%)
- Per-version metrics comparison
- Device cohort breakdown
6. Rollback: Safe Recovery
Rollback to a previous stable version if issues are detected:
# Automatic rollback triggered by health checks
# Or manual rollback:
federation.rollback(
model_id=model['id'],
target_version="1.0.0" # Revert to stable version
)
Rollback triggers:
- Error rate spike: Sudden increase in prediction failures
- Latency degradation: Model slower than previous version
- User reports: Manual intervention based on user feedback
- Manual override: Developer-initiated rollback
Rollback process:
- Pause current deployment
- Mark problematic version as Rollback state
- Re-deploy previous stable version
- Notify team via alerts
- Post-mortem to identify root cause
Version Management Best Practices
Version Naming Convention
Use semantic versioning (MAJOR.MINOR.PATCH):
1.0.0 → Initial release
1.1.0 → Added new fraud detection features (backward compatible)
1.1.1 → Fixed bug in preprocessing (no new features)
2.0.0 → Changed input schema (breaking change)
Model Lineage Tracking
EdgeML automatically tracks:
- Parent version: Which version this was trained from
- Training rounds: How many rounds were used
- Participant count: How many devices contributed
- Convergence metrics: Final loss and accuracy
Example lineage:
v1.0.0 (base model)
└─ v1.1.0 (20 rounds, 1000 devices)
└─ v1.2.0 (10 rounds, 1500 devices)
└─ v1.1.1 (5 rounds, 500 devices, hotfix)
A/B Testing Models
Deploy multiple versions simultaneously to compare performance:
# Deploy v1.0.0 to 50% of devices
federation.deploy(model_id=model['id'], version="1.0.0", rollout_percentage=50)
# Deploy v1.1.0 to other 50%
federation.deploy(model_id=model['id'], version="1.1.0", rollout_percentage=50)
# Compare metrics after 1 week
# Winner becomes the new stable version
Real-World Example: Mobile Keyboard
Initial Launch (v1.0.0)
- Create: Register "next-word-predictor" model
- Train: 50 rounds with 10,000 devices
- Publish: Version 1.0.0 with ONNX/CoreML formats
- Deploy: Canary to 5%, then full rollout over 2 weeks
- Monitor: 99.9% download success, <50ms latency
Feature Update (v1.1.0)
- Train: Add emoji prediction, 30 rounds with 15,000 devices
- Publish: Version 1.1.0
- Deploy: Gradual rollout starting at 10%
- Issue detected at 25%: Emoji suggestions causing app crashes on iOS 14
- Rollback: Automatic revert to v1.0.0
- Root cause: Bug in emoji tokenization for older iOS versions
Bug Fix (v1.1.1)
- Train: Fix iOS 14 compatibility, 10 rounds with 5,000 devices
- Publish: Version 1.1.1
- Deploy: Successful rollout to 100%
- Monitor: No issues, becomes new stable version
Next Steps
- Federated Learning - Understand the fundamentals
- Training Rounds - Deep dive into round mechanics
- Privacy Model - Privacy guarantees and threat model
- Python SDK Reference - Complete API documentation