Sunday, September 24, 2017

Connection pools and the cloud...how it can go wrong!

I will tell you the story starting from the very beginning...

In a project I'm working on in Voverc, we are running a bunch of microservices on Docker Swarm and at the beginning we decided to use Rexray (vs dead-end state project Flocker) to manage Docker volumes and their orchestration. As you certainly know, Docker Swarm is very powerful to orchestrate stateless containers but when you have a storage attached to your container (e.g. a MySQL database) and you want to move it to another machine, you have to manage the volume replication, mounting and failover by yourself. There is no integrated automated mechanism to solve this kind of issue. For this reason we have chosen a volume orchestrator like Rexray; it looked promising at the beginning, given the size of the community and the large support for many cloud storage providers (AWS, Google, Azure, etc). Shortly speaking, after some months of production use we faced a lot issues, in my opinion mainly due to Azure cloud and its integration still not ready for production use. For example we had some long downtime caused by Docker Swarm and Rexray blocked and unable to move a container after a machine restart.
After long troubleshooting sessions and community help we decided to give up with this solution and put in place a MySQL Galera cluster to manage our MySQL databases; instead of having one MySQL container per microservice we put one schema per microservice on the same MySQL cluster (read this post if you want to know more). We wanted to actuate a fast transition to the new database so we provisioned a three node cluster on the same Azure region, accessible via public IP to avoid VLAN networking setup; we don't have a huge load so we thought that performance loss due to the use of public interfaces would be minimal. Actually, the new architecture was performing very well, we stress-tested it, simulated failovers and monitoring load-balancing. Good, ready for production then!

Here the pain begins...

After some days of production use we started to face strange, apparently random, receive timeouts on upstream services calling microservices in the Docker Swarm. When I say strange, I say that some requests, when debugging with Postman, took exactly 15 minutes and few seconds (in production we obviously use low timeouts to fail-fast). At first, we investigated to exclude Docker Swarm routing issues or Galera cluster replication misconfigurations. At some point, I was wondering if the problem could be coming from the connection pool and I switched from Tomcat to Hikari that has better debug logging. I noticed these stranges logs:

DEBUG com.zaxxer.hikari.pool.PoolBase - Closing connection com.mysql.jdbc.JDBC4Connection@229960c: (connection is evicted or dead)

Well, dropped connections! Seen that, I started to do some tcpdump on both sides to check if there was something strange with MySQL TCP connections and it's precisely there that I've got the inspiration...what if everything here is fucked up by the NAT keep-alive timeout? So I checked Azure NAT keep-alive timeout (4 minutes) and reduced CP maxLifeTime and idleTimeout below this number: This in order to tell the CP to kill connections before they were cut for inactivity on the NAT. Magically everything was good with these settings!

Lesson learned: never access a MySQL database on the cloud via public IP or at least consider that there is NAT and keep-alive timeouts!

Friday, August 4, 2017

How to open your gate with your smartphone

How many times do you forget your external gate remote, especially when you go out for running or cycling? Well, for me it happens so frequently that I decided to find a way to open the gate using my iPhone, considering that it is way less probable that I forget it compared to the remote :)

Wow, I have a new Home Automation project to implement...let's do it!

Basically the my conditions / requirements are the following:

  • I don't want to tweak my gate board, considering my limited knowledge on its circuitry
  • I have WIFI network coverage at the gate
  • I want to integrate this system with my existing HA webapp (SmartHome)
  • Everything should be cheaper of every other existing solutions 
  • Bonus point: I would like to also use Siri on my iPhone and say "Hey Siri, open the gate"
Given the previous points, the first idea came to my mind is to use the existing remote I have and control its activation using a relay, driven by a tiny Raspberry PI Zero W (yes, W stands for wireless, cool!). Activating a relay with a Raspberry PI is quite straightforward using Python and you can find a lot of tutorials on the internet. The only part that requires a bit of effort is soldering the cables to bring out the battery from the remote in order to enable it using the relay. The gate button can be easy blocked on pushed position using a nylon cable tie.

Once you are able to toggle the relay, you can expose this operation using Python Flask web framework and you can integrate the API in your HA platform. In my case I can open the gate using the following shell command:

curl -X POST localhost:5000/toggle-relay?delay=1

As you can see I also added the delay parameter to control the duration of the relay activation (see the code here).
At this point the project is almost complete but, wait, we have the bonus point, I want it :) After googling a bit I've found the promising HomeBridge project. Essentially it allows you integrate whatever device you want with iOS HomeKit. I've looked a bit at the code and writing a custom accessory / platform is really easy, in my case, apart the boilerplate code, it's less than 10 lines of JS (see the code here)!



You can build this smart gate opener with less than 10 euros (price doesn't include the remote and its battery cost :D)...very easy and useful project! See you at the next one!