A software engineer's life: September 2017

I will tell you the story starting from the very beginning...

In a project I'm working on in Voverc, we are running a bunch of microservices on Docker Swarm and at the beginning we decided to use Rexray (vs dead-end state project Flocker) to manage Docker volumes and their orchestration. As you certainly know, Docker Swarm is very powerful to orchestrate stateless containers but when you have a storage attached to your container (e.g. a MySQL database) and you want to move it to another machine, you have to manage the volume replication, mounting and failover by yourself. There is no integrated automated mechanism to solve this kind of issue. For this reason we have chosen a volume orchestrator like Rexray; it looked promising at the beginning, given the size of the community and the large support for many cloud storage providers (AWS, Google, Azure, etc). Shortly speaking, after some months of production use we faced a lot issues, in my opinion mainly due to Azure cloud and its integration still not ready for production use. For example we had some long downtime caused by Docker Swarm and Rexray blocked and unable to move a container after a machine restart.
After long troubleshooting sessions and community help we decided to give up with this solution and put in place a MySQL Galera cluster to manage our MySQL databases; instead of having one MySQL container per microservice we put one schema per microservice on the same MySQL cluster (read this post if you want to know more). We wanted to actuate a fast transition to the new database so we provisioned a three node cluster on the same Azure region, accessible via public IP to avoid VLAN networking setup; we don't have a huge load so we thought that performance loss due to the use of public interfaces would be minimal. Actually, the new architecture was performing very well, we stress-tested it, simulated failovers and monitoring load-balancing. Good, ready for production then!

Here the pain begins...

After some days of production use we started to face strange, apparently random, receive timeouts on upstream services calling microservices in the Docker Swarm. When I say strange, I say that some requests, when debugging with Postman, took exactly 15 minutes and few seconds (in production we obviously use low timeouts to fail-fast). At first, we investigated to exclude Docker Swarm routing issues or Galera cluster replication misconfigurations. At some point, I was wondering if the problem could be coming from the connection pool and I switched from Tomcat to Hikari that has better debug logging. I noticed these stranges logs:

DEBUG com.zaxxer.hikari.pool.PoolBase - Closing connection com.mysql.jdbc.JDBC4Connection@229960c: (connection is evicted or dead)

Well, dropped connections! Seen that, I started to do some tcpdump on both sides to check if there was something strange with MySQL TCP connections and it's precisely there that I've got the inspiration...what if everything here is fucked up by the NAT keep-alive timeout? So I checked Azure NAT keep-alive timeout (4 minutes) and reduced CP maxLifeTime and idleTimeout below this number: This in order to tell the CP to kill connections before they were cut for inactivity on the NAT. Magically everything was good with these settings!

Lesson learned: never access a MySQL database on the cloud via public IP or at least consider that there is NAT and keep-alive timeouts!

Sunday, September 24, 2017

Connection pools and the cloud...how it can go wrong!