Occasional blog posts from a random systems engineer

Two-Rant-Saturday: Golang, dependencies and Vault

· Read in about 8 min · (1581 Words)

Over the past couple of days, I’ve had a couple of things that have been annoyed me and have resulted in a learning and a success!

Dependency OOMs

I’ve been doing a bit more work on my virtual machine agent and had been trying to build it and began getting OOM errors.

I followed some of the failures (it was a strong assumption that the library it was failing to build was the culprit, rather than the straw that broke the camel’s back). Among other things, the agent embeds a Hashicorp consul client. However, to create a consul client, the application needs to build most of the consul codebase… which includes a tool called go-discover (also a Hashicorp library). This has quite a few modules named “aws”, “gcp”, … the list goes on. Each of these pull in the SDKs for the cloud providers.

The dependency chain was:

vm agent
  → consul/agent
    → go-discover
      → provider/aws
        → aws-sdk-go-v2/service/ec2

So I spent a solid two hours trying to exclude just the AWS SDK. First attempt: run go mod vendor and stub out the AWS provider in the vendor directory. I created a simple stub file that just returned “provider is disabled in this build”.

When I ran go mod vendor to check what was happening, the session died from OOM. That should have warned me what was coming.

Next, I decided to try a different approach. I removed the vendor directory and created a third_party directory instead, using Go’s replace directive to point to my stubbed version:

replace github.com/hashicorp/go-discover/provider/aws => ./third_party/github.com/hashicorp/go-discover/provider/aws
replace github.com/hashicorp/go-discover/provider/gcp => ./third_party/github.com/hashicorp/go-discover/provider/gcp

I stubbed out both AWS and GCE providers in the third_party directory.

But the build still failed. The problem was that my third_party directory didn’t have proper go.mod files—it was just source files. So I cloned the upstream go-discover repo instead to use as a proper module replacement.

Here’s where things went wrong. The cloned repo contained a full vendor directory with all the cloud provider SDKs. I now had a 5.1MB vendor directory with 200+ AWS SDK v1 files, plus the AWS SDK v2 that consul was pulling in. The replace directive was redirecting the package path for my code, but Go was still compiling everything in the vendor directory of the third_party package.

I’d made the problem worse. The replace directive affects my codebase, but it doesn’t affect third_party packages or those included in the build tree. The AWS SDK was still being compiled because go-discover’s vendor directory was being included in the build.

Next approach: vendor directory stubbing. I figured I could stub out the AWS packages directly in the vendor directory. Created stubs for the go-discover provider and a few consul AWS dependencies.

The build failed immediately. I found out that consul itself has direct AWS imports. It’s not just go-discover—consul’s CA providers, auth methods, and envoy extensions all import AWS SDK directly. Stubbing one level wasn’t enough.

Third attempt: build tags. Go’s idiomatic way to exclude code. Created a //go:build noaws tag, updated the Makefile, and created stub modules with their own go.mod files for all the AWS dependencies. Added replace directives to go.mod pointing to my stubs.

It still failed. The build kept compiling AWS SDK packages and eventually died with OOM. When I checked the dependency graph, the packages were still there.

At that point I started wondering why the AWS SDK was still being compiled. I discovered that consul v1.22.6 has direct imports of AWS SDK packages. When code imports “github.com/aws/aws-sdk-go-v2/aws”, Go must compile that package. Replace directives work for package-level redirects, but they can’t make packages disappear when they’re directly imported.

I found 7 consul files with direct AWS SDK imports—CA providers, auth methods, Lambda extensions. Build tags on my stub files don’t affect consul’s source files. The only way to exclude them is to add build tags to consul’s source directly.

So I downloaded the full consul v1.22.6 source (~200MB) and added build tags to the AWS-dependent files. But even that didn’t work because I found more uses of AWS and Azure packages throughout the codebase.

After messing about for quite some time, I eventually decided to rip consul out entirely (the functionality was mostly being replaced in a newer iteration anyway). I originally built the PoC of the agent with the Consul agent code included - I tried ripping out the whole of consul and found the binary dropped from ~120MB to 8MB. Geesh.


Vault HCL Templates

I’ve been setting up Vault PKI with JWT authentication, and I wanted to do something that seemed reasonable: use JWT claims to dynamically shape PKI issuance. JWT identifies a machine and carries hostname and IP address, Vault stores these claims in identity metadata, and the PKI role uses that identity to issue a certificate with the right common_name and SANs.

Seemed straightforward. I tried putting this in the PKI role:

allowed_parameters = {
  common_name = ["{{identity.entity.metadata.hostname}}.example.com"]
  ip_sans = ["{{identity.entity.metadata.ipv4_address}}", "127.0.0.1"]
}

It didn’t work. The values resolved to empty strings.

Then I tried identity.entity.name instead of metadata. That was also empty. I tried alias metadata—empty. Token metadata—also empty.

Looking at the token:

{ "request_id": "b511d99e-a3f0-3669-1d25-b2b5c696b487", "lease_id": "", "lease_duration": 0, "renewable": false, "data": { "aliases": [ { "canonical_id": "35bc7868-d567-5da7-0ff0-c996111d5ac3", "creation_time": "2026-04-10T04:34:31.333776115Z", "custom_metadata": null, "id": "7b4aaadb-1944-891a-7817-4901772f10c7", "last_update_time": "2026-06-13T11:11:01.471764394Z", "local": false, "merged_from_canonical_ids": null, "metadata": { "hostnamen": "myvm", "ipaddr": "172.16.54.23", "role": "vm-basic" }, "mount_accessor": "auth_jwt_24631e32", "mount_path": "auth/virtualmachine_auth/", "mount_type": "jwt", "name": "3d0603fa-6698-4e0f-80da-53977297838b" } ], "creation_time": "2026-04-10T04:34:31.333757574Z", "direct_group_ids": [], "disabled": false, "group_ids": [], "id": "35bc7868-d567-5da7-0ff0-c996111d5ac3", "inherited_group_ids": [], "last_update_time": "2026-04-10T04:34:31.333757574Z", "merged_entity_ids": null, "metadata": null, "name": "entity_6bf0b8f3", "namespace_id": "root", "policies": [] }, "warnings": [], "mount_type": "identity" }

The metadata was clearly there

The only thing that worked was hardcoding:

allowed_parameters = {
  common_name = ["mymachine.example.com"]
}

This works, but defeats the entire purpose. I can’t hardcode every hostname for hundreds of VMs and the whole point of using JWT tokens, writing an API on the hypervisor to generate JWT tokens and an agent that should be deployable to any VM.

I started digging. Found PR #10682 from HashiCorp that attempted to add token metadata interpolation in ACL policy templates but that was closed and not merged. I tried to follow it and some other issues where Hashicorp seemed to be against it, but I’d woken up at 4AM and couldn’t get my head around their objections.

The maintainer’s reasoning seemed to be: “This change is dangerous… a user can create child tokens with arbitrary metadata.”

Turns out HashiCorp explicitly rejected the feature I was trying to use because of security concerns. They decided token metadata is not trustworthy for policy evaluation — it’s user-influenced and can be copied into child tokens, which would break ACL trust boundaries.

Fair enough. But then I discovered another limitation. Vault policy templating isn’t just restricted—it’s split into two fundamentally different worlds.

This is what worked:

path "pki/issue/{{identity.entity.name}}" {
  capabilities = ["update"]
}

But this didn’t work:

allowed_parameters = {
  common_name = ["{{identity.entity.name}}.example.com"]
}

Identity interpolation is only supported in policy path keys — it’s not a general expression system for all policy fields. This is documented as Policy authors can pass in a policy path containing double curly braces as templating delimiters but it’s easy to miss the implication that nothing else does.

I’ve hit this exact issue multiple times over the past 3 years. Every time I want to do something secure and scalable, I discover that Vault’s policy templating is not suitable — no matter how much it looks like it could/should be.

Granted I didn’t spend too much time (this iteration) reading much about their rational this time around, because I’d spent too much time trying to fix my policy before realising the limitation, but my understanding is:

  • Policy content should be “static guardrails”
  • Metadata can’t be trusted because it can be user-defined when creating tokens

The latter doesn’t seem right, since allowing the use within paths would be absurd. But if this is the case, JWT claims are there to be trusted (otherwise they are pointless), so to me it’s either a design flaw in Vault (not having a trustable set of attributes that JWT claims can be associated to) OR lack of trust in policy makers. If I have a system whereby the policies with the dynamic content are only assignable to JWT auth mechanisms and those policies don’t allow the creation of tokens, or is the policy usable because it hard references the accessor of the JWT auth mount, then I can design my policies and roles in such a way that it is secure.

I can’t see this being much different to AWS IAM, whereby using conditional statements in policies that reference the IAM entity - of course, if you allow the user such permissions that they can create their own identities and fake the attributes, then you have a problem, but the feature should not be absent for such a reason.

Given this is purely for my homelab, I think this feature would be immesney useful.

So I did what I tend to do when I disagree… I forked it.

Using AI assistance, I was able to add support for identity interpolation in PKI role parameter constraints in a couple of minutes to add the missing interpolation logic and test coverage.

Now I have a fork a CI/CD pipeline, I have a compiled binary and a changeset that supports the thing I’ve been trying to do for three years.

The question is… do I just replace it on one of the live servers?