Systems Engineering an agent thanks to the world of Golang
My homelab and work life have followed very different trajectories, though often influencing one another. I like to try out interesting thought experiments at home, see how they work out to determine whether they’re worth investing in.
Right now, my homelab consists of a load of VMs and my initial goal was to find a new way of monitoring (often these small tasks lead to a big spray of different tasks).
I had decided to ditch Icinga, ditch custom monitoring solutions and unify on node_exporter. The reason is not because it’s better and the actual reason isn’t too important for this post. However, what it entails to get there is what I want to talk about. My existing Prometheus-esque stack use node exporter’s Consul discovery heavily, and I liked it. The problem was — most of my VMs weren’t integrated with Consul (until now, it was used in a specific Hashicorp part of my stack).
So, my standard approach to this had been very static: deploy consul-template, which bootstrapped a Vault agent, which in turn allowed for another consul-template to bootstrap a Consul client.
Whilst deploying this iteration of the stack, I had moved heavily to Docker — and this isn’t “Docker for deploying application”, this is using Docker to deploy everything after the initial cloud-init bootstrap of a VM. However, the setup was pretty static and quite cumbersome — tonnes of roles, policies and more roles and more policies (and completely isolated for each application — consul-template, Vault agent, etc.). Pre-provisioning for one of these VMs was literally hundreds of Terraform resources and, whilst it was fine for that part of the stack, for some simple integration for monitoring, I wanted some more dynamic.
So… after being jealous of the fact that cloud providers provide identity management for VMs which can then be used in all sorts of ways, I wrote a small JWT service, which provides a metadata endpoint to the VMs. This meant every VM could query a well-known IP for a token that could then be used to bootstrap itself further.
Once this was done, I set up a basic configuration inside for Vault/Consul, which would allow the VM to authenticate, obtain a Consul token and register itself and, for now, node_exporter as a service.
VM Setup
Now, we come to the part I want to talk about… the VM. As I said, historically this meant deploying a bunch of containers and, for what I was trying to achieve, also node_exporter. This had a lot of downsides:
- lots of Terraform to maintain (especially deploying to ~50 separate machines)
- lots of dependencies, it meant pre-provisioning configs onto the machine using Terraform, which I’m not that keen on (means to an end)
- tonnes of inter-dependencies — all of the containers relied on each other, in a long chain
This seems to be a common thing, though—people layering tonnes of custom services onto a machine to bootstrap just the basic base services before even setting it up for the purpose it’s being deployed for. Running in systemd seems to be no better—lots of packages to deploy interconnecting services, which end up being barely tested (or even testable!), rely on monitoring of systemd to check for health (and pulling logs), especially when most of these services are written by someone else.
So so so! After reading about Microsoft’s mess and probably getting inspiration completely in a complete tangent to the author’s expectations, I decided—you don’t need to be a big corporation to own your own agents.
Goal
Since more and more of the tools I use are written in Golang, I started wondering: why not just write a single application—a single binary(!)—that could handle all the grunt work for a VM? One binary, one log stream, one thing to keep up—simple.
This is what happens when a VM starts:
VM starts -> Calls JWT metadata endpoint -> gets JWT -> Authenticates to Vault -> Fetches Consul token -> Starts embedded Consul agent -> Registers itself + node_exporter -> Starts token renewal loop
After a bit of experimenting (and a little help from AI—my personal time has become limited), I ended up with a single deployable binary. It allows just one configuration to enable or disable the Consul registration for node_exporter. Everything else—domains, datacenters, SSL, etc.—is hardcoded. Why? Because these aren’t variables—they’re constants. This massively reduced configuration drift, Terraform complexity, and failure modes.
| Problem | Old Approach | New Approach |
|---|---|---|
| Bootstrap complexity | Terraform + consul-template + Vault agent | Single Go binary |
| Dependencies | Multiple containers / services | None (self-contained) |
| Identity | Pre-provisioned | Dynamic (JWT endpoint) |
| Failure modes | Chain failures | Single process |
| Observability | Many logs | One log stream |
Opinionated by design
One of the biggest shifts: I stopped making everything configurable. I didn’t have configs littered everywhere with domains, datacenter and TLS config. It was all hard coded, which is geat because my (and in general) stacks are configured in a particular way, so within the system they’re constants. This decision alone simplified deployment, reduced errors, and made the agent much easier to reason about.
What’s even better is, I design a stack that performs a purpose - the purpose of consul-consul-template-vault-proxy is to provide a token for consul-template which bootstraps consul. You can see this because containers share a mount point.. oh yes, super obvious. But now, the golang version:
if err := a.fetchJWT(); err != nil {
return err
}
if err := a.authenticateVault(); err != nil {
return err
}
if err := a.fetchConsulSecrets(); err != nil {
return err
}
if err := a.generateClientCert(); err != nil {
return err
}
if err := a.setupConsulAgent(); err != nil {
return err
}
...
This transparency in how we’re using tools and how we expect the to work is incomparable.
Embedding services instead of deploying them
At this point, I had a machine registered in Consul, but node_exporter wasn’t running yet. Normally, I’d spin up another container, write a systemd unit or handle a bunch of configs. But… I just couldn’t bring myself to do it.
Instead, I realised: why not embed node_exporter directly into my single Go binary? After a quick experiment:
go nodeexporter.Main()
That’s it. No containers, no systemd units, no config files. node_exporter just runs as part of the agent.
And this is where I feel blessed that tools like node_exporter and many other tools in the industry have been converging to Golang. Most seem to have fairly obvious and public entrypoints, meaning that they can easily be called directly in a goroutine, letting an agent orchestrate it.
So instead of fighting with interdependent services, Docker mounts, or Terraform-heavy pre-provisioning, I just composed the service into my agent with everything in one binary, one process, one log stream.
Before vs After
Before, deploying node_exporter looked like this:
VM
├── consul-template (configs + dockerr)
├── Vault agent (configs + dockerr)
├── Consul client (configs + dockerr)
├── node_exporter (separate)
└── Docker glue
After embedding:
VM
└── Single Go binary
├── JWT auth
├── Vault login
├── Consul agent (embedded subroutine)
├── Node_exporter (embedded subroutine)
├── Token renewal loop (embedded subroutine)
└── Supervisor monitoring go subroutines
Not only this… I literally had a single 100MB binary (without trying to remove any unnecessary dependencies), which is already multiple times smaller than the combination of the tools that I was using before. Not only that, but as soon as anything expired, it just rotated and worked… it didn’t need Docker mount binds that sometimes cause issues because of rshared/rprivate blah blah… it just worked.
Next steps
So I want to go further… I want to push in my logging stack (hey, half of those are written in Golang, so why not).
But before then, I’ll package it nicely, add some solid ways of deploying it (somewhere between cloud-init and Ansible) and some good monitoring and quality of life for ensuring it stays up and look at failure modes.
But hey, I’ve gone from managing 5 daemons (hopefully increasing coverage to 6–8) to a single one and all tied to an authentication endpoint that is now natively available on every VM.