Occasional blog posts from a random systems engineer

Securing Gitlab Nomad deployments using Vault and Consul and traefik

· Read in about 14 min · (2779 Words)

Pre-amble

This blog post was written in-flight during a quest to create a secure deployment mechanism for Terraform projects to Vault, Consul and Nomad.

The beginning portion was written whilst attempting to use a technique that ended up failing.

Feel free to skip this portion and jump to “Using JWT authentication”

Intro

For the Hashicorp stack of my homelab, I have:

  • Vault cluster
  • Consul cluster (single DC)
  • Nomad servers
  • Nomad clients using multiple datacenters

An offline root CA, intermediate CA and terraform state are stored/managed by Minio (local S3-compatible alternative) and an seperated isolated Vault cluster.

Given most of the deployment requires a high level of privilege, both around generating certificates from the intermediate PKI, handling root/bootstrap tokens etc, all of the deployment of the core of this stack are deployed using Terraform… locally.

Given that Vault allows internal certificates to easily be signed securely (which was harder to do with the pre-existing PKI provided by FreeIPA), I can now use verifiable end-to-end encryption for internal traffic.

Historically, web traffic for normal applications running on legacy (Rancher based) Kubernetes clusters would be:

Inbound Traffic --(letsencrypt cert)--> Internal/External HA proxy --(HTTP)--> Rancher node ingress (same host header match) --> application

Goals

With the cluster running, I wish to deploy a proxy (Traefik) and applications ontop of the nomad clients.

I will split this into two parts - specific configuration for Traefik and then configurations for general application deployment, though these will have overlap.

The deployments of these should be capable of being automated using either a Terraform Cloud alternative, such as Terrarun (https://github.com/MatthewJohn/terrarun), Terrakube (https://github.com/AzBuilder/terrakube) or something like Jenkins/Gitlab pipelines. These must be given as few privileges as possible, but also, should be confined at deploy time to the current deployment context.

The entrypoint of the deployment should require as few customisations as possible - for example: The deployment has a single secret (or vault agent with approle pre-defined) and vault URL. Using this, everything else will be obtained (consul/nomad details and tokens).

In an example of this stack (https://github.com/MatthewJohn/vault-nomad-consul-terraform), I created a pre-deployment “service role” module (https://github.com/MatthewJohn/vault-nomad-consul-terraform/tree/main/modules/service_role), which would be a controlled way of defining permissions for application deployments - each one is given an approle with specific permisisons for the deployment. The actual deployment is given the approle details and everything else is obtained from this. During this investigation, I would like to avoid doing this, so that new applications can be deployed without prior setup.

The final goal around traffic will be to ensure that all traffic is end-to-end encrypted, with all traffic verifying the SSL certs.

Vault policy metadata

  • Create an approle for the deployment
  • Provide metadata to approle when obtaining secret, containing the cluster and the
  • Use a single policy

Getting started

I’ll start by defining some common templates, which makes this easier:

  • Vault deployment policy/approle: nomad-deployment-job-${var.nomad_region.name}-${var.nomad_datacenter.name}-${var.name}, e.g.: nomad-deployment-job-global-dc1-my-test-app
  • Vault nomad policy/approle: nomad-job-${var.nomad_region.name}-${var.nomad_datacenter.name}, e.g.: nomad-job-global-dc1
  • Vault secrets: ${var.vault_cluster.service_secrets_mount_path}/${var.nomad_region.name}/${var.nomad_datacenter.name}/${var.name}, e.g.: service_secrets_kv/global/dc1/my-test-app/some-secret
  • Consul policy: nomad-job-${var.nomad_region.name}-${var.nomad_datacenter.name}-${var.name}, e.g.: nomad-job-global-dc1-my-test-app
  • Nomad deployment job policy/role: nomad-deployment-job-${var.nomad_region.name}-${var.nomad_datacenter.name}-${var.name}

The flow will be:

  • Terraform uses generic approle to authenticate to vault
  • Terraform uses approle to authenticate to new approle, specifying metadata for the deployment context, whilst generating secret_id
  • Terraform generates token using token role, which is provided to nomad during deployment
  • Nomad is provided with vault policy for appplication, which it is able to generate tokens for

Let’s start with the vault policy, we’ll take arguments defining the nomad region/datacenter, since we’ll have an approle/policy for each of these:

locals {
  vault_deployment_policy_role = "nomad-deployment-${var.nomad_region.name}-${var.nomad_datacenter.name}"
}

resource "vault_policy" "deployment_policy" {
  name = local.vault_deployment_policy_role

  policy = <<EOF
# Allow tokens to look up their own properties
path "auth/token/lookup-self" {
    capabilities = ["read"]
}

# Allow tokens to renew themselves
path "auth/token/renew-self" {
    capabilities = ["update"]
}

# Allow tokens to revoke themselves
path "auth/token/revoke-self" {
    capabilities = ["update"]
}

# Allow a token to look up its own capabilities on a path
path "sys/capabilities-self" {
    capabilities = ["update"]
}

# Provide privileges for Terraform to be able to create a vault token
# as per https://registry.terraform.io/providers/hashicorp/vault/latest/docs/resources/token
path "auth/token/lookup-accessor" {
  capabilities = ["update"]
}
path "auth/token/revoke-accessor" {
  capabilities = ["update"]
}
EOF

Next, let’s allow define some metadata for the approle:

{
    "DeploymentServiceName": "my-test-app"
}

Simple!

Now we can allow the creation of vault policies for the initial terraform and the deployment roles and policies:

locals {
  vault_terraform_policy_role  = "nomad-terraform-${var.nomad_region.name}-${var.nomad_datacenter.name}"
  vault_deployment_policy_role = "nomad-deployment-${var.nomad_region.name}-${var.nomad_datacenter.name}"
  vault_nomad_policy_role      = "nomad-submit-${var.nomad_region.name}-${var.nomad_datacenter.name}"
  vault_job_policy_role        = "nomad-job-${var.nomad_region.name}-${var.nomad_datacenter.name}"
}

# Policy that will be attached to the application
resource "vault_policy" "application_policy" {
  name = local.vault_job_policy_role

  policy = <<EOF

EOF
}

# Policy for token that will be provided to nomad to perform deployment
resource "vault_policy" "nomad_policy" {
  name = local.vault_nomad_policy_role

  policy = <<EOF
path "auth/token/lookup-self"
{
  capabilities = [ "read" ]
}

EOF
}

# Create vault auth role to allow the token
# to pass the role to nomad for the application
resource "vault_token_auth_backend_role" "nomad" {
  role_name              = local.vault_nomad_policy_role

  allowed_policies       = [
    vault_policy.nomad_policy.name
  ]
  orphan                 = true
  token_period           = "86400"
  renewable              = true
  path_suffix            = local.vault_nomad_policy_role
}

# Create vault policy for deployment
resource "vault_policy" "deployment_policy" {
  name = local.vault_deployment_policy_role

  policy = <<EOF
...

# Allow creation of token using role
path "auth/token/create/${vault_token_auth_backend_role.nomad.role_name}"
{
  capabilities = [ "create", "update", "sudo" ]
}

EOF
}

resource "vault_token_auth_backend_role" "deployment" {
  role_name              = local.vault_deployment_policy_role

  allowed_policies       = [
    vault_policy.deployment_policy.name
  ]
  orphan                 = true
  token_period           = "86400"
  renewable              = true
  path_suffix            = local.vault_deployment_policy_role
}

resource "vault_policy" "terraform_policy" {
  name = local.vault_terraform_policy_role

  policy = <<EOF
# Allow creation of token using role
path "auth/token/create/${vault_token_auth_backend_role.deployment.role_name}"
{
  capabilities = [ "update" ]
}

# Required to allow terraform to generate token.
path "auth/token/lookup-accessor" {
  capabilities = ["update"]
}
path "auth/token/revoke-accessor" {
  capabilities = ["update"]
}
EOF
}

resource "vault_approle_auth_backend_role" "terraform" {
  backend        = var.nomad_region.approle_mount_path
  role_name      = local.vault_terraform_policy_role
  token_policies = [
    # Policies provided to deployment role
    # to perform actions in terraform to perform deployment
    vault_policy.terraform_policy.name,
  ]
}

At this point, I wanted to lock down the deployment token to only be able to generate a “nomad submit” token with the same metadata:

resource "vault_policy" "deployment_policy" {
  name = local.vault_deployment_policy_role

  policy = <<EOF
...

# Allow creation of token using role
path "auth/token/create/${vault_token_auth_backend_role.nomad.role_name}"
{
  capabilities = [ "update" ]

  # Limit to only being able to generate token, providing the same meta information as the approle
  # required_parameters = ["meta"]
  allowed_parameters = {
    "meta" = [{"DeploymentServiceName" = "{{identity.entity.metadata.DeploymentServiceName}}"}]
  }
}

EOF
}

However, it appears this is not supported and will not be supported: https://github.com/hashicorp/vault/pull/13715

Realising this, the whole plan had come to an end…

Using JWT authentication

Gitlab natively supports OIDC job authentication (https://docs.gitlab.com/ee/ci/secrets/id_token_authentication.html).

I haven’t come across anyone talking about this (outside of specifically searching for it) and it’s an obsenely underrated feature.

This completely takes the chicken-and-egg from handling secure authentication during deployments.

It’s worth noting that Gitlab does have a “vault” pipeline integration (https://docs.gitlab.com/ee/ci/examples/authenticating-with-hashicorp-vault/), which is an enterprise-only feature. But don’t let this put you off - since we’re using Terraform for deployments, there’s no need for the native “secret” integration! Though, I must say that Gitlab are kind enough to provide me with an Enterprise license for the open-source projects that I work on, so thank you for this!! They are a great company and provide a wonderful product, which is well worth paying for (no, they didn’t ask me to say this)!

No longer do you need:

  • Shared secrets that would allow any deployment to iteract with resources created by other projects
  • Create secrets for each deployment and have to handle moving these to each application’s project variables

The basic way that this works is:

  • You provide Gitlab with the URL of the audience (aud) for the OIDC token
  • Gitlab signs the JWT for a deployment, providing this as an environment variable
  • (In our case), the job authenticates to vault using this token
  • Vault verifies the token against Gitlab’s public keys

Gitlab config

In each of the Gitlab projects, we do the following templated configuration, integrating with the Gitlab Terraform template:

variables:
  VAULT_ROLE: my-test-app-deployment
  TF_STATE_NAME: my-test-app

terraform_deploy:
  variables:
    VAULT_ADDR: https://vault.svc.example.internal:8200
  id_tokens:
    VAULT_ID_TOKEN:
      aud: https://vault.svc.example.internal:8200
  before_script:
   # Install vault
   - wget "http://archives.example.internal/hashicorp/vault_1.16.1_linux_amd64.zip"
   - unzip vault_1.16.1_linux_amd64.zip
   # Vault login
   - export VAULT_TOKEN="$(./vault write -field=token auth/gitlab_jwt/login role=$VAULT_ROLE jwt=$VAULT_ID_TOKEN)"
  extends: .terraform:deploy
  script:
    - gitlab-terraform plan
    - gitlab-terraform plan-json
    - gitlab-terraform apply
  environment:
    name: $TF_STATE_NAME
    action: start

Initially, for simplicity, vault is downloaded at runtime, to use the upstream docker images, for now. A role in vault is defined per application, which is authenticated to during deployment. This generates a vault token and is exposed for the Terraform plan/apply steps.

Vault config

To setup vault for this, we need to configure a new JWT auth backend:

resource "vault_jwt_auth_backend" "gitlab" {
  description = "Gitlab JWT auth backend"
  path        = "gitlab_jwt"
  type        = "jwt"

  oidc_discovery_url = "https://mygitlab.example.internal"
  bound_issuer       = "https://mygitlab.example.internal"
}

We simply provide the Gitlab URL and that’s it!

Project configuration

Whilst it would be possible to automatically detect a claim from the token and use this as a variable in each a set of pre-defined generic policies, this doesn’t help with providing the minimal access - some applications require more/fewer permissions and this would only be easily achievable by providing a wider permission set to all applications.

Instead, I’ve opted to still create a set of policies for each application. As before, this would include:

  • Vault token obtained by Gitlab, which would be used by Terraform for authenticating to vault
  • A consul backend role and policy, which could be used by Terraform to obtain a token for setting up the service (service intentions etc.)
  • A nomad backend role and policy, which would be used by Terraform for creating the job
  • A consul and vault backend role and policy, which would be provided to Nomad during the job submission.
  • A consul and vault policy that would be provided to the application for runtime permissions.

The terraform for managing these project permissions sets up the required roles/policies and generates a role in the Gitlab JWT backend:

resource "vault_jwt_auth_backend_role" "gitlab" {
  count = var.vault_cluster.gitlab_jwt_auth_backend_path != null ? 1 : 0

  backend        = var.vault_cluster.gitlab_jwt_auth_backend_path
  role_name      = "${var.nomad_datacenter.name}-${var.service_name}"
  token_policies = [vault_policy.terraform_policy.name]

  bound_claims = {
    project_path = var.gitlab_project_path
  }
  user_claim             = "user_email"
  role_type              = "jwt"
  token_explicit_max_ttl = 300
}

I opted to use the project_path claim for authentication rather than the project_id. The reason for this is that:

  • the configuration is less prone to error - I can see that “my-project/my-app” would be correct for the service role for “my-app”. project_id = 5231 on the otherhand is not.
  • all of these projects are internal,
  • the owner of the project is aware of the importance of their project name/path,
  • project paths cannot be stolen unless the original project changes it’s name (which is covered by the previous point).

I created a simple-to-use Terraform module, which would be called for each project for creating these. Additional policy statements can be passed in for providing additional permissions on a per-application basis.

After creating the required resources, the module creates a vault secret, containing all of the required details, including application name, vault mounts, role/policy names, domain names. The initial Terraform vault token uses this to bootstrap the remainder of the deployment configuration:

my-application/terraform/backend.tf:
provider "vault" {
  address          = "https://vault.svc.example.internal:8200"
  ca_cert_file     = data.external.root_cert.result.path
  skip_child_token = true
}

data "vault_kv_secret_v2" "config" {
  mount = "deployment_secrets_kv"
  name  = "konvad/services/global/dc1/auth-proxy-int-jenkins"
}

locals {
  config = nonsensitive(merge(
    data.vault_kv_secret_v2.config.data,
    {
      "consul" = jsondecode(data.vault_kv_secret_v2.config.data.consul)
      "nomad"  = jsondecode(data.vault_kv_secret_v2.config.data.nomad)
      "vault"  = jsondecode(data.vault_kv_secret_v2.config.data.vault)
    }
  ))
}

# Obtain nomad token from vault consul engine
data "vault_generic_secret" "consul_token" {
  path = "${local.config.vault_consul_engine_path}/creds/${local.config.vault_consul_role_name}"
}

# Login to consul using token from consul engine from vault
provider "consul" {
  address    = local.config.consul.address
  datacenter = local.config.consul.datacenter
  token      = data.vault_generic_secret.consul_token.data["token"]
  ca_pem     = local.config.consul.root_cert_public_key
}

# Obtain nomad token from vault nomad engine
data "vault_generic_secret" "nomad_token" {
  path = "${local.config.vault_nomad_engine_path}/creds/${local.config.vault_nomad_role_name}"
}

provider "nomad" {
  address   = local.config.nomad.address
  region    = local.config.nomad.region
  secret_id = data.vault_generic_secret.nomad_token.data["secret_id"]
  ca_pem    = local.config.nomad.root_cert_public_key
}

From here’s we’re able to perform the remainder of the deployment, setting up intentions, Vault secrets (etc.) and creating the Nomad job.

Traefik allowing Host and protecting service communication

Host header should be configurable by the application, as the backing application may require the Host header to be set to the front-end domain.

End user --SSL--> HA Proxy --SSL--> traefik --mesh--> App

DNS Configuration

To get DNS working, a seperate domain, which is configured as alt_domain in the consul hosts was setup. On core internal DNS servers, the root domain (matching the root alt_domain) was setup, with NS records for each of the consul hosts and a forwarder configured (since the core DNS servers are recursive nameservers).

PKI configuration

Since we’re now using an alternative domain for applications, which is dedicated to ingress, PKI roles for each of the traefik instances can be created, which allows certificates to be created for the alt_domain:


Proxy configurations

Traefik config

locals {
  service_domain = "web.service.dc1.consul.example.internal"
}

job "${var.service_role.name}" {
  # Configure as a system job, so that
  # it runs on each of the nomad servers,
  # meaning that we can use DNS for each of the services
  # running on the nomad clients and traefik will, naturally,
  # be running on the host on a static port.
  type        = "system"
  ...
  group "traefik" {
    ...
    task "server" {
      ...
      config {
        ...
        args = [
          ...

          # Expose all consul services by defaut, using consul connect
          "--providers.consulcatalog.connectByDefault=true",
          "--providers.consulcatalog.connectAware=true",
          "--providers.consulcatalog.exposedByDefault=true",

          # Specify the traefik service name and the meta tag pre-fix for traefik configurations
          "--providers.consulcatalog.servicename=${var.service_role.consul_service_name}",
          "--providers.consulcatalog.prefix=traefik",
          # Watch for changes
          "--providers.consulcatalog.watch=true",

          # Connection details to consul
          "--providers.consulcatalog.endpoint.address=${var.service_role.consul.address_wo_protocol}",
          "--providers.consulcatalog.endpoint.scheme=https",
          "--providers.consulcatalog.endpoint.datacenter=${var.service_role.consul.datacenter}",

          # Allow applications to determine if they are published via traefik.
          "--providers.consulcatalog.constraints=Tag(`traefik-routing`)",
          # We allow the "authorative" domain, using the designated consul service
          # and the main "service" domain OR allowing a custom header,
          # which uses the "authoritive" consul name.
          # Since services cannot specify their own consul service name,
          # this isolates routing and stops services from overlapping (intentionally or otherwise).
          "--providers.consulcatalog.defaultRule=Host(`{{ .Name }}.${local.service_domain}`) || Headers(`consulservice`, `{{ .Name }}`)",
        ]
      }
    }
  }
}

With traefik now successfully deployed, along with a test application, from the HA proxy instanace, I can access the application:

user@internal-ha-proxy:~# curl https://nomad-job-dc1-test-app.service.dc1.app.svc.example.internal/headers -k
{"Host": "nomad-job-dc1-test-app.service.dc1.app.svc.example.internal", "User-Agent": "curl/7.52.1", "Accept": "*/*", "X-Forwarded-For": "172.16.2.2", "X-Forwarded-Host": "nomad-job-dc1-test-app.service.dc1.app.svc.example.internal", "X-Forwarded-Port": "443", "X-Forwarded-Proto": "https", "X-Forwarded-Server": "07c3b9df4e61", "X-Real-Ip": "172.16.2.2", "Accept-Encoding": "gzip"}

Next to setup a test HAProxy config… in this example, the test application is test-app:

  • consul service: nomad-job-external-cluster-test-app
  • consul DNS domain: service.dc1.app.svc.example.internal
  • front-end domain: testapp.dockstudios.co.uk

HAProxy config:

backend nomad-test-app
    http-request set-header Connection keep-alive
    # This is optional, as it will be the host header set by the HAProxy frontend
    http-request set-header Host testapp.dockstudios.co.uk
    http-request set-header X-Forwarded-Proto https
    http-request set-header consulservice nomad-job-external-cluster-test-app
    option ssl-hello-chk

    option httpchk GET /health
    http-check send hdr Host nomad-job-dc1-test-app.service.dc1.app.svc.example.internal
    http-check expect status 200

    server traefik nomad-job-dc1-test-app.service.dc1.app.svc.example.internal:443 ssl verify required ca-file /etc/haproxy/internal_root_cert.pem check

Initially, I was seeing:

Apr 11 06:51:47 inthetz haproxy[9837]: Server nomad-test-app/test-app-traefik is DOWN, reason: Layer6 invalid response, info: "SSL handshake failure", check duration: 9ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

Using curl and openssl, I verified that the certificate was correct:

root@inthetz:~# openssl s_client -showcerts -servername nomad-job-dc1-test-app.service.dc1.app.svc.example.internal -connect nomad-job-dc1-test-app.service.dc1.app.svc.example.internal:443 -CAfile /etc/haproxy/internal_root_cert.pem </dev/null | grep Verification
depth=3 O = DockStudios Ltd, CN = DockStudios Root CA
verify return:1
depth=2 CN = Dockstudios Vault Intermediate
verify return:1
depth=1 C = GB, ST = Hampshire, L = United Kingdom, O = Dock Studios Ltd, OU = SVC, CN = DockStudios SVC Intermediate CA
verify return:1
depth=0 CN = *.service.dc1.app.svc.example.internal
verify return:1
Verification: OK
DONE
root@inthetz:~# curl https://nomad-job-dc1-test-app.service.dc1.app.svc.example.internal/health --cacert /etc/haproxy/internal_root_cert.pem
ok

After removing the option ssl-hello-chk and combing the host header into the httpchk option (due to an older version of HAProxy), it started working and disabling ssl verification:

backend nomad-test-app
    http-request set-header Connection keep-alive
    # This is optional, as it will be the host header set by the HAProxy frontend
    http-request set-header Host testapp.dockstudios.co.uk
    http-request set-header X-Forwarded-Proto https
    http-request set-header konvadservice nomad-job-dc1-test-app

    option httpchk GET /health HTTP/1.1\r\nHost:\ nomad-job-dc1-test-app.service.dc1.app.svc.example.internal
    http-check expect status 200

    server test-app-traefik nomad-job-dc1-test-app.service.dc1.app.svc.example.internal:443 ssl verify none ca-file /etc/haproxy/internal_root_cert.pem check

After some work, I realised the check must be check-ssl and during testing, I also added the sni configuration:

	server test-app-traefik nomad-job-dc1-test-app.service.dc1.app.svc.example.internal:443 ssl verify required ca-file /etc/haproxy/internal_root_cert.pem check-ssl sni str(nomad-job-dc1-test-app.service.dc1.app.svc.example.internal)

Now, removing the ca-file causes the check to fail - ensuring that the SSL certificate is correctly being validated

Checking host headers

At this point, we can access the application through the chain of proxies and taking a look at a header test endpoint:

➜  test-app git:(main) ✗ curl https://testapp.dockstudios.co.uk/headers
{"Host": "testapp.dockstudios.co.uk", "User-Agent": "curl/7.68.0", "Accept": "*/*", "consulservice": "nomad-job-dc1-test-app", "X-Forwarded-For": "172.16.85.12", "X-Forwarded-Host": "testapp.dockstudios.co.uk", "X-Forwarded-Port": "443", "X-Forwarded-Proto": "https", "X-Forwarded-Server": "07c3b9df4e61", "X-Real-Ip": "172.16.85.12", "Accept-Encoding": "gzip"}

PERFECT! :D

If you’d like to see more about my Hashicorp setup with Vault, Consul and Nomad, see https://github.com/MatthewJohn/vault-nomad-consul-terraform

Comments