Building a small all-in-one high-availability stack

27 Apr, 2018 · Read in about 12 min · (2488 Words)

Goals

I started off with a basic goal - host a small python website, which uses a MySQL-like database on a highly available cluster.

After recently moving from a rack in a datacenter (and fortunately saving ~£250/month), I started looking at hosting some web applications on VPSs from a couple of providers.

Of course, this means:

Every instance costs money - I can no longer spin up 10 extra VMs because the hardware is already running… each instance costs.
I no longer have control of the underlying hardware - if something goes wrong, I am 100% reliant on someone else fixing it (hopefully within a reasonable time)
The VPSs all have public IP addresses, I have no private network for communications and, although some providers give you a private network between instances, from my perspective, this should never be treated as ‘secure’.
I’m a cheap skate - each VPS only has 4GB RAM, so the entire setup must be slim.
To protect myself from the loss of a provider, I also spread across multiple providers. This means that the whole setup must be provider-agnostic (so won’t be able to use things like provider-firewalls etc.)

With all this in mind, I set sail to create a setup that would handle the failure of any instance (or any service on any instance) and use as few VPSs as possible.

VPN

Peer-to-peer

Each VPS instance is configured with tinc (https://tinc-vpn.org/) - an open source VPN supporting mesh communication.

This allows each VPS to communicate without a centralised VPN server (or even meta-data server) - as long as two VPSs are running, the VPN will connect and they will be able to communicate.

The initial setup requires creating certificates for each instance and distributing the public keys, each with a pre-defined private IP within the mesh network.

Home-lab communications

To communicate back to servers hosted in a home-lab, to allow for any internet connection issues, changes to public addresses etc, OpenVPN is used to establish connections.

Each VPS is configured as an OpenVPN server and a VPN VM (“vps-vpn-interconnect”) runs within the home-lab which connects out to the VPS to establish a connection - this means the vps-vpn-interconnect VM has a independent network for each of the VPSs and the firewall routes traffic to the OpenVPN IPs of the VPSs via a “vps-vpn-interconnect” VM.

Bind

For public DNS, I use bind - this is something I’ve used at work for very high traffic use-cases and it’s something that I’m used it. This runs on each of the VPSs and each of the domains are configured with the VPSs as the nameservers.

It is configured using a “hidden master” setup, with the master running on a home-lab instance, which communicate via the OpenVPN connection. This allows configuration to be deployed to a single place and bind notifies each of the VPSs, which then obtain the latest zone configurations automatically.

This works absolutely perfectly for static DNS entries, but has no dynamic capabilities, meaning that if a VPS goes down, DNS will still serve each of the hosts in DNS responses.

Consul to the rescue

Consul saved me from this issue… I attach a secondary public IP to each of the instances and binding consul to this, then use the internal VPN IPs for consul communication.

Each VPS is configured with services for each of it’s IPs (both public and private). A-records for each of the consul public IPs are then configured and bind then delegates a sub-domain to the each of these.

Each “service” record (e.g. blog.mattsbit.co.uk) is then CNAMEd (in bind) to the relevant consul service record. Consul on the VPSs then monitors one-another and, when an instance fails, consul removes that instance from DNS.

To recap on DNS, the current structure is this:

Attempt to resolve: blog.mattsbit.co.uk
Name servers for mattsbit.co.uk -> VPS (bind)
Record for blog.mattsbit.co.uk -> CNAME to web.service.cluster.dockstudios.co.uk
cluster.dockstudios.co.uk delegated to consul IPs
consul resolves working hosts providing web.service.cluster.dockstudios.co.uk service

Docker

To host actual applications from within the cluster, I’m using Docker with Docker Swarm (yes, I know it’s not hip anymore.. but try running kubernetes on machines with <2GB available memory and 1-2CPU cores!).

Docker management is bound to the internal VPN IPs and the OpenVPN IPs - docker swarm is advertised over the tinc IPs, allowing for the instances to communicate with one-another privately and the OpenVPN IPs allow them to be configured via Portainer on a home-lab VM.

Each service running in docker in bound to the tinc IPs of the VM. A port range has been allocated to the services and explicitly denied in IPTables to avoid any accidental forwarding from the public IPs.

Docker images are hosted on a docker registry VM inside the home-lab, which each of the VPSs use, reducing the use of public services (such as DockerHub, which is renowned for it’s rate limiting and not something you want to face during a host failure).

HAProxy (RIP NGinx)

Originally, nginx was being as the primary reverse proxy, running on each of the VPSs.

However, the configurations became quite long (though this could have been improved with some includes), the result was thousands of lines of configurations and resolution to across VPSs wasn’t fantastic - either nginx would forward to the local docker daemon, which would fail if it wasn’t running or nginx would round-robbin to all, meaning performance would be degraded for 2/3 of the requests.

Instead, I switched to HAProxy and, in retrospect, this was a fantastic decision.

The configuration for a site was very basic:

A HTTP redirect in the front-end block, ignoring SSL certificate requests (more on that later):

        redirect scheme https code 302 if { hdr(Host) -i blog.mattsbit.co.uk } !{ ssl_fc } !url_acme

A HTTPS front-end block:

        acl host_blog_mattsbit_co_uk hdr(host) -i blog.mattsbit.co.uk
        use_backend blog_mattsbit_co_uk if host_blog_mattsbit_co_uk

And, finally, a backend block for the service:

backend blog_mattsbit_co_uk
        mode http
        server blog_mattsbit_co_uk 127.0.0.1:<Docker Port> check
        server error_backend 127.0.0.1:<Maintenance page port> backup

Though this example does use just the local VPS as the backend, this can be easily extended with the backend using the local VPS as primary, and the other VPSs as backup servers and some simple front-end logic to failover to an “error” backend, if the main backend has no healthy hosts:

 acl blog_matts_bit_healthy nb_srv(block_mattsbit_co_uk) le 1
 use_backend error_backend if blog_matts_bit_healthy

Avoiding round trips

Since publicly accessible services that are hosted within the home-lab are used by other services within the home-lab, the flow of traffic would end up:

Internet -> VPS -> OpenVPN -> home-lab instance -> internet -> VPS -> OpenVPN -> Another home-lab instance

Since this is obviously incredibly wasteful, a VPS “shim”, running some of the VPS services (haproxy, xinetd and other forwarding services) runs on a VM within the home-lab. Historically, the home-lab DNS servers would need to rewrite individual domains or subdomains to point to the local proxy. However, adding additional consul services and splitting based on those hosted directly inside the VPS cluster and those forwarded to the home-lab means the consul service record can be overridden using split-DNS within the home-lab to point directly at the local proxy, whilst allowing VPS-run services to point to the internal OpenVPN IPs of VPSs (using consul):

vps-run-service.mattsbit.co.uk -> CNAME: internal-vps-web.service.cluster.dockstudios.co.uk -> home-lab DNS -> VPS Consul -> Resolve: VPS OpenVPN IPs
home-lab-run-service.mattsbit.co.uk -> CNAME: web.service.cluster.dockstudios.co.uk -> home-lab DNS -> Resolve to local proxy

SSL Certificates

HAProxy makes SSL certificate management much easier - it can be provided with a simple list file, which contains a simple mapping of SSL certificate to domain mapping:

/etc/letsencrypt/some/ssl/path.pem mydomain.com

SSL certificates are managed by a central VM, which includes: Generating HAProxy configurations for each of the VPSs, renewing certificates, generating the SSL list file. These are they broadcast to each of the VPSs, where the configuration is validated with HAProxy and HAProxy is the reloaded. The HAProxy configuration contains a backend for this centralised VM, meaning all letsencrypt domain verification requests are sent to the central VM.

IPTables

The IPTables rules become crucial in this configuration, since IPTables is used for forwarding some traffic from the primary interfaces to other interfaces, the primary interfaces have to be configured with ip_forwarding enabled. To ensure that all private interfaces are kept private, IPTables is used as a barrier.

Since the list of ports for valid external connections is relatively small (80, 443, 25, 22), the IPTables rules can begin with simple rules to allow these ports and then reject all other NEW connections via the external interfaces. The VPN interfaces become a little more relaxed, allowing additional ports for docker, communications from HAProxy to machines via the OpenVPN connections.

The add more protection, all internal applications (VPNs, OpenSSH etc.) all use random ports and fail2ban quickly blocks any potential attacks on these.

MariaDB

As previously stated, the applications I was looking to host weren’t completely stateless and some sort of database was required.

For the majority of the application, MySQL/MariaDB are supported and, historically, a lot of the application that I had written that would be hosted, also support this.

MariaDB is setup in a multi-master mode, each VPS is configured with MySQL running on the host, using wsrep clustering.

To ensure performance and high availability with applications communicating with MySQL, each docker stack that requires access to MySQL is deployed with a proxy. Originally, this used proxysql, but due to long-running issues, I eventually switched to HAProxy.

This is configured with:

Single backend with primary server, which points to the local machine
Backup hosts are configured with the remaining VPSs.

The container has a volume mount of a file that contains the details of the current VPS (IP, hostname etc.) and the entrypoint uses this to generate the haproxy configuration with the main and backup servers. This means that, when the application attempts to contact the MySQL server, it connects to the proxy. The proxy then primarily attempts to connect to the local VPS, but falls back to the others.

Backups

Backups of the MySQL cluster are critical, so to achieve a high granularity with these, two types of backups are performed:

Daily full dumps are performed on each host.
Transaction archive logs are transferred to a backup directory every 5 minutes. This includes a cron, which rotates the transaction archive log, meaning that the archive file is always generated/rotated at this interval.
Using a custom backup tool, a remote backup server takes snapshots every 15 minutes from each of the VPSs, capturing the full daily backup, as well as the archive logs. These are also stored independently in different directories, using hard links between the backups, meaning that any changes to these files will not overwrite pre-existing versions.

GlusterFS

Due to some applications requiring shared data (such as wordpress assets) and other applications needing alternative databases (which end up being run within docker), there is a need for shared storage across the VPSs.

To achieve this, GlusterFS is used within the cluster - the setup is very simple - GlusterFS is configured on each VPS, connects to each of the other VPSs (using the VPN) and is mounted on each of the VPSs, which is used by the docker containers.

Writes on this are relatively slow, but this is generally used for static assets (which are read much more frequently than they are written to) or databases that perform more read operations than writes. Note that these databases run scaled to a single container.

xinetd

For non-HTTP connections, such as those for Git over SSH, xinetd is used to forward TCP connections from each of the VPSs to the relevant server over the OpenVPN connection.

Whilst this in itself works fine, the result is that the traffic has both SNAT and DNAT applied to it. This is required, as without SNAT, the destination server would attempt to route the return traffic through the firewall’s main gateway (being a completely separate internet connection). This means that the destination server receives traffic from one of the VPS instances, meaning that employing tools like fail2ban are very difficult.

To combat this, GRE tunnels were implemented. Each VPS is configured with a GRE interface for each of the services that it provides over xinetd.

The “destination” server (the server running the service that is being forwarded to), runs a GRE interface per-VPS. Each of these GRE tunnels use a small /30 subnet and, on the destination server, have a independent route tables for each of the interfaces, which has a “default gateway” (i.e. 0.0.0.0/0) of the GRE interface on the relevant VPS, e.g.:

# GRE interface on "destination server" for VPS1
ip tunnel add gre1 mode gre local $DESTINATION_SERVER_IP remote $VPS_OPENVPN_IP ttl 255
ip addr add $DESTINATION_SERVER_GRE_IP/30 dev gre1
ip rule add from $GRE_NETWORK_ADDRESS/30 table GRE1
ip route add default via $VPS_GRE_INTERFACE_IP table GRE1

Xinetd then forwards traffic to the GRE interface address of the destination server.

As a result, the traffic no longer needs SNATing - the destination server is able to route the return traffic back through the same GRE interface, which is then routes back to the VPS that the traffic came from. Since the traffic now appears to come from the traffic real source IP, fail2ban can start denying requests from these IPs.

Useful tools

Icinga

Icinga (or more specifically icinga2) is used for monitoring the VPS, ensuring each service is running, checking DNS records return the expected results, disk space, MySQL replication, GlusterFS replication etc.

This has proved to be a very useful tool - it allows templating of hundreds of checks against a “VPS host template” and setup each VPS with a set of variables (for public IP addresses, VPN IPs addresses etc.), which are then used to perform the checks on.

Though not specific to Icinga, checks are also setup for services that should not work - for example checking that MySQL is not accessible via the public IPs - this makes performing package updates and IPTables changes less risky, with sanity checks to ensure something hasn’t been opened to the world.

Conclusion

This setup has gradually evolved over the past 15 years and, touch wood, works quite well. VPS instance failure is fairly common (every couple of months) and does show that this high availability works. Recovery from instance failure generally requires no manual intervention (both from short connection issues to full reboots).

Entire cluster reboots also generally work well - though docker swarm has seen a corruption in metadata, resulting in a new swarm needing to be created (though no much more than some pipeline executions to get them rebuilt).

Working with iptables with 6+ interfaces on a host is certainly sometimes interesting, but I’ve found having critical paths (allowing particular public ports and then denying all other traffic from those interfaces) allows for a much safer solution.

I think this stack certainly pushes the VPSs to their limit and for a total of $~25/month, I’m certainly getting my money’s worth!