Model Lifecycle

Models move through a consistent lifecycle in EdgeML: creation, training, versioning, deployment, and rollback.

Lifecycle State Machine

Lifecycle Stages

Create: register a model in the catalog.
Train: generate a new artifact via federated rounds.
Version: publish a semantic version (e.g., 1.0.0).
Deploy: ship to device cohorts with rollout controls.
Monitor: track performance and stability.
Rollback: revert to a safe version if needed.

Progressive Rollout Flow

Figure 2: Progressive rollout strategy gradually increases deployment percentage while monitoring metrics at each stage. Any issues trigger automatic rollback to the previous stable version.

Why it matters

Enables safe, repeatable model releases
Improves auditability and compliance
Provides clear ownership of model changes

Detailed Stage Breakdown

1. Create: Model Registration

registry = edgeml.ModelRegistry(api_key="ek_live_...")
model = registry.ensure_model(
    name="fraud-detector",
    framework="pytorch",
    use_case="fraud_detection",
    description="Credit card fraud detection model"
)

At this stage:

Model metadata is stored (name, framework, use case)
Unique model ID is generated
No trained artifacts exist yet
Model is in Draft state

2. Train: Federated Training

Generate a new model version through federated rounds:

federation = edgeml.Federation(api_key="ek_live_...")
result = federation.train(
    model="fraud-detector",
    rounds=20,
    min_updates=100,
    base_version="1.0.0",
    new_version="1.1.0"
)

During training:

Devices participate in multiple rounds
Server aggregates updates using FedAvg
Training metrics (loss, accuracy) are tracked
Model transitions to Training state

After training completes:

New model artifact is stored in S3/MinIO
Training metadata is persisted (round count, participant count, convergence metrics)
Model moves to Trained state

3. Version: Publication

Publish a trained model to make it available for deployment:

registry.publish_version(
    model_id=model['id'],
    version="1.1.0"
)

Publishing:

Marks the version as production-ready
Converts model to deployment formats (ONNX, TFLite, CoreML)
Computes model checksums for integrity verification
Model transitions to Published state

Semantic versioning guidelines:

Major (2.0.0): Breaking changes to input/output schema
Minor (1.1.0): New features, backward compatible
Patch (1.0.1): Bug fixes, no new features

4. Deploy: Progressive Rollout

Deploy a published version to edge devices with gradual rollout:

deployment = federation.deploy(
    model_id=model['id'],
    version="1.1.0",
    rollout_percentage=10,      # Start at 10%
    target_percentage=100,       # Goal: 100%
    increment_step=10,           # Increase by 10% each step
    start_immediately=True
)

Deployment strategy:

Canary (10%): Deploy to 10% of devices, monitor for 24-48 hours
Gradual increase (25%, 50%, 75%): Increase rollout if metrics are healthy
Full rollout (100%): Deploy to all devices once validated

Rollout controls:

Device sampling: Random selection or targeted cohorts (e.g., iOS only, specific regions)
Health checks: Automatic monitoring of error rates, latency, accuracy
Pause/resume: Manual controls to halt rollout if issues arise

5. Monitor: Health Tracking

EdgeML automatically tracks deployment health:

Key metrics:

Download success rate: % of devices that successfully downloaded the model
Inference latency: Model prediction time (p50, p95, p99)
Error rate: % of predictions that failed
Accuracy drift: Comparison to validation set accuracy

Alerting thresholds:

# Example: Alert if error rate > 5%
if error_rate > 0.05:
    trigger_rollback()

Monitor through the dashboard:

Real-time deployment progress (10% → 25% → 50% → 100%)
Per-version metrics comparison
Device cohort breakdown

6. Rollback: Safe Recovery

Rollback to a previous stable version if issues are detected:

# Automatic rollback triggered by health checks
# Or manual rollback:
federation.rollback(
    model_id=model['id'],
    target_version="1.0.0"  # Revert to stable version
)

Rollback triggers:

Error rate spike: Sudden increase in prediction failures
Latency degradation: Model slower than previous version
User reports: Manual intervention based on user feedback
Manual override: Developer-initiated rollback

Rollback process:

Pause current deployment
Mark problematic version as Rollback state
Re-deploy previous stable version
Notify team via alerts
Post-mortem to identify root cause

Version Management Best Practices

Version Naming Convention

Use semantic versioning (MAJOR.MINOR.PATCH):

0.0 → Initial release
1.0 → Added new fraud detection features (backward compatible)
1.1 → Fixed bug in preprocessing (no new features)
0.0 → Changed input schema (breaking change)

Model Lineage Tracking

EdgeML automatically tracks:

Parent version: Which version this was trained from
Training rounds: How many rounds were used
Participant count: How many devices contributed
Convergence metrics: Final loss and accuracy

Example lineage:

v1.0.0 (base model)
  └─ v1.1.0 (20 rounds, 1000 devices)
      └─ v1.2.0 (10 rounds, 1500 devices)
      └─ v1.1.1 (5 rounds, 500 devices, hotfix)

A/B Testing Models

Deploy multiple versions simultaneously to compare performance:

# Deploy v1.0.0 to 50% of devices
federation.deploy(model_id=model['id'], version="1.0.0", rollout_percentage=50)

# Deploy v1.1.0 to other 50%
federation.deploy(model_id=model['id'], version="1.1.0", rollout_percentage=50)

# Compare metrics after 1 week
# Winner becomes the new stable version

Real-World Example: Mobile Keyboard

Initial Launch (v1.0.0)

Create: Register "next-word-predictor" model
Train: 50 rounds with 10,000 devices
Publish: Version 1.0.0 with ONNX/CoreML formats
Deploy: Canary to 5%, then full rollout over 2 weeks
Monitor: 99.9% download success, <50ms latency

Feature Update (v1.1.0)

Train: Add emoji prediction, 30 rounds with 15,000 devices
Publish: Version 1.1.0
Deploy: Gradual rollout starting at 10%
Issue detected at 25%: Emoji suggestions causing app crashes on iOS 14
Rollback: Automatic revert to v1.0.0
Root cause: Bug in emoji tokenization for older iOS versions

Bug Fix (v1.1.1)

Train: Fix iOS 14 compatibility, 10 rounds with 5,000 devices
Publish: Version 1.1.1
Deploy: Successful rollout to 100%
Monitor: No issues, becomes new stable version

Next Steps

Federated Learning - Understand the fundamentals
Training Rounds - Deep dive into round mechanics
Privacy Model - Privacy guarantees and threat model
Python SDK Reference - Complete API documentation

Lifecycle State Machine​

Lifecycle Stages​

Progressive Rollout Flow​

Why it matters​

Detailed Stage Breakdown​

1. Create: Model Registration​

2. Train: Federated Training​

3. Version: Publication​

4. Deploy: Progressive Rollout​

5. Monitor: Health Tracking​

6. Rollback: Safe Recovery​

Version Management Best Practices​

Version Naming Convention​

Model Lineage Tracking​

A/B Testing Models​

Real-World Example: Mobile Keyboard​

Initial Launch (v1.0.0)​

Feature Update (v1.1.0)​

Bug Fix (v1.1.1)​

Next Steps​