Skip to content

Outage Detection and Recovery

Drove continuously tracks local service instances and reconciles actual state against desired service state.

Instance health detection and tracking

Executor runs periodic readiness and health checks according to local service specification.

  • Readiness checks gate transition into ready/healthy states.
  • Health checks run periodically for already running instances.

Results of these checks (success/failure) are reported to controller and can trigger stop/replace/reconcile flows based on service state and active operation.

Container crash

If a local service instance crashes, Drove reconciliation detects drift and attempts to restore desired instances per host.

Executor node failure

When an executor becomes unavailable, instances on that node are marked lost from cluster perspective. On recovery, executor-controller reconciliation decides whether to keep or clean up containers based on current cluster truth.

Executor service temporary unavailability

On restart, executor recovers container metadata and reports back to controller. Any stale or unexpected containers are handled through reconciliation.

Zombie container detection and cleanup

Executor periodically reconciles local containers against controller state:

  • containers running without corresponding desired metadata are cleaned up
  • desired instances missing on host are marked lost and replaced according to service policy

Behavior in maintenance and host blacklisting windows

  • In maintenance mode, new write operations are rejected by leader controller.
  • During executor blacklisting, local service placement/reconciliation may defer changes on affected nodes until cluster is in a stable state.

Note

For planned node work, deactivate or adjust local services in advance to reduce unnecessary churn.