Graceful Restarts at Scale
Cloudflare operates network services handling millions of requests per second globally, making downtime—even brief—catastrophic. Critical services performing traffic routing, TLS lifecycle management, and firewall enforcement cannot afford interruptions. When updates and security patches are needed, they must deploy without dropping connections or failing requests.
The Problem
The naive approach of stopping the old process and starting a new one creates a window where incoming connections are refused with ECONNREFUSED. For services handling thousands of requests per second across hundreds of data centers, a 100ms gap translates to hundreds of dropped connections. Additionally, stopping the old process immediately terminates all established connections—abruptly disconnecting clients uploading files, streaming video, or maintaining WebSockets and gRPC streams.
Attempts to use SO_REUSEPORT to bind multiple processes to the same port before shutdown fail because the kernel assigns incoming connections to listening sockets but terminates orphaned connections if a process exits before accepting them.
How ecdysis Works
ecdysis uses an approach pioneered by NGINX:
- The parent process forks a child process, which replaces itself with the new code version via
execve() - Socket file descriptors are inherited via a named pipe shared with the parent
- Both processes temporarily share the listening socket, allowing the parent to continue accepting connections during the child's initialization
- The child signals readiness to the parent, which then closes its socket copy and drains remaining connections
- If the child crashes during initialization, the parent never stopped listening, so no connections are dropped
This architecture eliminates coverage gaps while providing a safe initialization window for new code.
Features and Integration
ecdysis provides native support for modern Rust development:
- Tokio integration: Async stream wrappers allow inherited sockets to become listeners without additional glue code
- systemd support: Optional systemd-notify integration for process lifecycle management
- Crash safety: Failed initializations don't interrupt the running service
Availability
The library is now available on GitHub, crates.io, and docs.rs under an open-source license, enabling any organization running critical Rust services to implement zero-downtime upgrades.