It's October 2018. I'm a Docker Swarm fan. I use it. My clients use it. I even made the world's best online and real-world courses on it. This is a comprehensive review of the past, present, and future of Swarm.
You may have seen my article from five months ago "It's 2018, Is Swarm Dead?" In it, I address the [false] assumption that Docker was replacing Swarm with Kubernetes. On the contrary, Docker added Kubernetes as a orchestrator option to Docker's growing list.
2019 Update: Docker again confirms Swarms future at DockerCon 2018 EU and DockerCon 2019 with 1. New Swarm features, 2. Large majority of Docker's 700+ customers use Swarm, 3. Swarm team is hiring, 4. Swarm usage is increases on every industry report I've seen in the last year.
Now that we're past the release of Docker Enterprise 2.0 with Kubernetes support, I think it's time to revisit the SwarmKit and libnetwork projects and see what's been going on for the last twelve months.
This article is a superset of detailed information from my video interview below where I got to sit down with Docker Swarm engineer Anshul Pundir and talk about the past, present, and future of the Docker Swarm container orchestrator.
Reminder: "Docker Swarm Mode" is the featureset we all use in Docker Engine from the Docker CLI or various GUI's. Those features are made possible mostly by SwarmKit and libnetwork open source libraries. For us to eventually see a feature in "Docker Engine Release 18.XX," the maintainers first need to add the functionality to the library repos. This means that sometimes, if you see a new feature in SwarmKit/libnetwork, it may take months or more for the feature to surface in the next Docker engine release.
Finished Work in Late 2017 and 2018
Let's talk about what's been done over the last twelve months. The major efforts have been around bug fixes, improving scalability, technical debt, and making it easier to troubleshoot and monitor. Here are some major themes and related merged pull requests.
- A series of SwarmKit and libnetwork fixes to provide the possibility for true zero-downtime rolling updates of Swarm services (assuming you're using healthchecks and proper app connection management). Mostly tracked in moby#30321, and the last known issue (that I'm aware of) was merged for 18.05 with PR moby#36638
- The above makes it even easier to run a one-node Swarm in those cases you don't need hardware fault-tolerance but still want zero downtime rolling updates. I always recommend Swarm for single-node Docker setups over using scripts or docker-compose cli.
- Adding a network troubleshooting client/server tool that includes a diagnostic API endpoint and cli client. libnetwork#2032
- Adding metrics for lots of Swarm objects swarmkit#2673.
- Improving scale of VIP load balancing with epic moby#37372, which closes the issue for this famous early attempt to scale a production-level Swarm to 50k services. Includes improving CPU performance of large task deployments swarmkit#2675.
- Windows Server 1709 and 1803 now support full Swarm networking including overlay networks, routing mesh, and VIP's. Here's Microsoft's announcement about it. Note that you need to run Server Core 1709 or newer VM's and also Docker 18.03 or newer for this to work. Also be aware of Windows container version compatibility limits. Swarm can also create tasks that use Hyper-V isolation or Process isolation in docker run, swarm create, and stack deploy. moby#34424
- Changing raft replication to a streaming snapshot model for improving scale and performance of large swarms swarmkit#2458.
- US Government NIST FIPS 140-2 security standard validation is ongoing, but the best case here is the latest Docker Enterprise release 18.03.1, which supports using Docker and Swarm with this FIPS standard. You can see the related PRs in SwarmKit happening in 2018. Eventually Kubernetes will get validated on Docker Enterprise FIPS 140-2 as well, according to the Docker blog post.
- TLA+ support for SwarmKit, which is a design specification for concurrent systems that can be tested with tools to basically validate quality code. It's too smart for me, but I hear it's important for ongoing scale and quality improvements.
- Feature parity in Docker CLI for Swarm/Kubernetes stacks. Use Docker CLI to manage apps in both orchestrators. cli#1031
- Adding docker engine configs for Swarm raft heartbeat and election waits; I assume to improve resiliency on high latency networks and prevent "manager flip/flops." moby#36726
- Adds SCTP protocol support to Swarm overlay networks. It's a different IP protocol than the already supported TCP/UDP/ICMP. moby#33922
- Improving details in Swarm debug logs. swarmkit#2486
- Create service tasks (containers) with custom hostnames based on node's hostname and other Swarm environment values (using templates). via moby#34686
That's definitely not all the work, so if you're looking for something specific, check the full list of closed PRs in moby/moby related to Swarm.
Other Works In Progress (WIP)
Next, let's look at what activities the Swarm team has publicly stated they are working on, or that they intend to work on.
- WIP for setting sysctl (Linux Kernel parameters) in Swarm services. The sysctl custom parameters are often used for custom kernel settings like network optimization. This works in
docker runtoday and is part of the ongoing effort to bring all security/kernel options from "run" to "services". See here for a full tracking list of
docker service createparity. moby#37701 and swarmkit#2729
- Device support: The most common use case here is GPU support, so you can schedule services that require/use one or more GPU's and Swarm will know where to schedule the task and how to use the resource. swarmkit#2682
- Kubemark-like performance testing and validation for Swarm clusters. We first saw some new metrics added to SwarmKit in swarmkit#2673, but I haven't seen any movement specifically on a user-focused testing tool. This idea of a benchmarking and testing tool for clusters has been mentioned by several engineers as something they'd like to see for Swarm, but no code yet that I see.
- New network and IP allocator (IPAM, port management). It turns out, it's a lot of work to have a group of servers all creating virtual networks, issue subnets and IP's that don't conflict, and keep that state accurate on every server when you have containers spinning up and down every second. Over the last few years we've seen various issues that - as far as I've seen in PRs - are now resolved. It sounds like this work will pave the way for future features and fewer edge cases. swarmkit#2686
- More advancements in load balancing. I assume this means more knobs to turn on the routing mesh and VIP features of the overlay network driver in Swarm. I don't have specifics.
- Cluster volumes. Docker doesn't have a built-in Swarm-aware volume driver yet. That work has been in discussion since early 2017. Today you can use REX-Ray and others as 3rd-party Docker Plug-ins with Swarm. There's also Docker's CloudStor driver, but that isn't yet open-sourced or built-in. The SwarmKit team is interested in picking this work up again and possibly supporting CSI, the Container Storage Interface, which other orchestrators now support.
- Custom default IP subnets are in bridge networks, coming to overlay hopefully. moby#36396
Docker Swarm Team Is Hiring
In my video interview with Anshul above, he points out that they are looking for Swarm engineers. See "Orchestration Engineers" on their careers listing for more info.