Skip to main content

Model Lifecycle

Models move through a consistent lifecycle in EdgeML: creation, training, versioning, deployment, and rollback.

Lifecycle State Machine

Lifecycle Stages

  1. Create: register a model in the catalog.
  2. Train: generate a new artifact via federated rounds.
  3. Version: publish a semantic version (e.g., 1.0.0).
  4. Deploy: ship to device cohorts with rollout controls.
  5. Monitor: track performance and stability.
  6. Rollback: revert to a safe version if needed.

Progressive Rollout Flow

Figure 2: Progressive rollout strategy gradually increases deployment percentage while monitoring metrics at each stage. Any issues trigger automatic rollback to the previous stable version.

Why it matters

  • Enables safe, repeatable model releases
  • Improves auditability and compliance
  • Provides clear ownership of model changes

Detailed Stage Breakdown

1. Create: Model Registration

Register a new model in EdgeML's catalog:

registry = edgeml.ModelRegistry(api_key="ek_live_...")
model = registry.ensure_model(
name="fraud-detector",
framework="pytorch",
use_case="fraud_detection",
description="Credit card fraud detection model"
)

At this stage:

  • Model metadata is stored (name, framework, use case)
  • Unique model ID is generated
  • No trained artifacts exist yet
  • Model is in Draft state

2. Train: Federated Training

Generate a new model version through federated rounds:

federation = edgeml.Federation(api_key="ek_live_...")
result = federation.train(
model="fraud-detector",
rounds=20,
min_updates=100,
base_version="1.0.0",
new_version="1.1.0"
)

During training:

  • Devices participate in multiple rounds
  • Server aggregates updates using FedAvg
  • Training metrics (loss, accuracy) are tracked
  • Model transitions to Training state

After training completes:

  • New model artifact is stored in S3/MinIO
  • Training metadata is persisted (round count, participant count, convergence metrics)
  • Model moves to Trained state

3. Version: Publication

Publish a trained model to make it available for deployment:

registry.publish_version(
model_id=model['id'],
version="1.1.0"
)

Publishing:

  • Marks the version as production-ready
  • Converts model to deployment formats (ONNX, TFLite, CoreML)
  • Computes model checksums for integrity verification
  • Model transitions to Published state

Semantic versioning guidelines:

  • Major (2.0.0): Breaking changes to input/output schema
  • Minor (1.1.0): New features, backward compatible
  • Patch (1.0.1): Bug fixes, no new features

4. Deploy: Progressive Rollout

Deploy a published version to edge devices with gradual rollout:

deployment = federation.deploy(
model_id=model['id'],
version="1.1.0",
rollout_percentage=10, # Start at 10%
target_percentage=100, # Goal: 100%
increment_step=10, # Increase by 10% each step
start_immediately=True
)

Deployment strategy:

  1. Canary (10%): Deploy to 10% of devices, monitor for 24-48 hours
  2. Gradual increase (25%, 50%, 75%): Increase rollout if metrics are healthy
  3. Full rollout (100%): Deploy to all devices once validated

Rollout controls:

  • Device sampling: Random selection or targeted cohorts (e.g., iOS only, specific regions)
  • Health checks: Automatic monitoring of error rates, latency, accuracy
  • Pause/resume: Manual controls to halt rollout if issues arise

5. Monitor: Health Tracking

EdgeML automatically tracks deployment health:

Key metrics:

  • Download success rate: % of devices that successfully downloaded the model
  • Inference latency: Model prediction time (p50, p95, p99)
  • Error rate: % of predictions that failed
  • Accuracy drift: Comparison to validation set accuracy

Alerting thresholds:

# Example: Alert if error rate > 5%
if error_rate > 0.05:
trigger_rollback()

Monitor through the dashboard:

  • Real-time deployment progress (10% → 25% → 50% → 100%)
  • Per-version metrics comparison
  • Device cohort breakdown

6. Rollback: Safe Recovery

Rollback to a previous stable version if issues are detected:

# Automatic rollback triggered by health checks
# Or manual rollback:
federation.rollback(
model_id=model['id'],
target_version="1.0.0" # Revert to stable version
)

Rollback triggers:

  • Error rate spike: Sudden increase in prediction failures
  • Latency degradation: Model slower than previous version
  • User reports: Manual intervention based on user feedback
  • Manual override: Developer-initiated rollback

Rollback process:

  1. Pause current deployment
  2. Mark problematic version as Rollback state
  3. Re-deploy previous stable version
  4. Notify team via alerts
  5. Post-mortem to identify root cause

Version Management Best Practices

Version Naming Convention

Use semantic versioning (MAJOR.MINOR.PATCH):

1.0.0 → Initial release
1.1.0 → Added new fraud detection features (backward compatible)
1.1.1 → Fixed bug in preprocessing (no new features)
2.0.0 → Changed input schema (breaking change)

Model Lineage Tracking

EdgeML automatically tracks:

  • Parent version: Which version this was trained from
  • Training rounds: How many rounds were used
  • Participant count: How many devices contributed
  • Convergence metrics: Final loss and accuracy

Example lineage:

v1.0.0 (base model)
└─ v1.1.0 (20 rounds, 1000 devices)
└─ v1.2.0 (10 rounds, 1500 devices)
└─ v1.1.1 (5 rounds, 500 devices, hotfix)

A/B Testing Models

Deploy multiple versions simultaneously to compare performance:

# Deploy v1.0.0 to 50% of devices
federation.deploy(model_id=model['id'], version="1.0.0", rollout_percentage=50)

# Deploy v1.1.0 to other 50%
federation.deploy(model_id=model['id'], version="1.1.0", rollout_percentage=50)

# Compare metrics after 1 week
# Winner becomes the new stable version

Real-World Example: Mobile Keyboard

Initial Launch (v1.0.0)

  • Create: Register "next-word-predictor" model
  • Train: 50 rounds with 10,000 devices
  • Publish: Version 1.0.0 with ONNX/CoreML formats
  • Deploy: Canary to 5%, then full rollout over 2 weeks
  • Monitor: 99.9% download success, <50ms latency

Feature Update (v1.1.0)

  • Train: Add emoji prediction, 30 rounds with 15,000 devices
  • Publish: Version 1.1.0
  • Deploy: Gradual rollout starting at 10%
  • Issue detected at 25%: Emoji suggestions causing app crashes on iOS 14
  • Rollback: Automatic revert to v1.0.0
  • Root cause: Bug in emoji tokenization for older iOS versions

Bug Fix (v1.1.1)

  • Train: Fix iOS 14 compatibility, 10 rounds with 5,000 devices
  • Publish: Version 1.1.1
  • Deploy: Successful rollout to 100%
  • Monitor: No issues, becomes new stable version

Next Steps