Vault anti-patterns

30min
|
Vault
Consul

Introduction

The Vault anti-patterns highlighted in this document are sourced from lessons learned by practitioners operating Vault in the field. As a Vault administrator, you can help keep your Vault environments healthy by avoiding these anti-patterns.

Anti-patterns

Description	Applicable Vault edition
Not adjusting the default lease time	All
Not using entities for accurate client count	Enterprise, HCP
Limiting IOPS	Enterprise, Community
Production clusters with no disaster recovery	Enterprise
Not testing disaster recovery solution	Enterprise
Slow upgrade cadence	Enterprise, Community
Upgrading Vault without proper testing	Enterprise, Community
Not rotating audit device logs	Enterprise, Community
Poor metrics or no telemetry data	Enterprise, Community
No baseline of activity or usage data	Enterprise, Community
Using the root token for routine actions	All
Not rekeying Vault after key-holders exit	All

Not adjusting the default lease time

The default lease time in Vault is 32 days or 768 hours. This time allows for some operations, such as re-authentication or renewal. See lease documentation for more information.

Potential issue:

If you create leases without changing the default time-to-live (TTL), leases will live in Vault until the default lease time is up. Depending on your infrastructure and available system memory, using the default or long TTL may cause performance issues as Vault stores leases in memory.

Solution:

You should tune the lease TTL value for your needs. Vault holds leases in memory until the lease expires. We recommend keeping TTLs as short as the use case will allow.

Note

Tuning or adjusting TTLs does not retroactively affect tokens that were issued. New tokens must be issued after tuning TTLs.

Not using entities for accurate client count

Each Vault client may have multiple accounts with the auth methods enabled on the Vault server. Entity

Potential issue:

Each new client is counted as a identity when using another auth method not linked to the user's entity.

Solution:

Since each token adds to the client count, and each unique authentication issues a token, it is best to use identity entities to create aliases that connect each login to a single identity.

Limiting IOPS

IOPS (input/output operations per second) measures performance for Vault cluster members. Vault is bound by the IO limits of the storage backend rather than the compute requirements.

Potential issue:

Limiting IOPS can have a significant performance impact.

Solution:

Use the HashiCorp reference guidelines for hardware sizing and network considerations for Vault servers.

Note

The Transform (Enterprise) and Transit secret engines can be resource intensive depending on the client count.

Production clusters with no disaster recovery

HashiCorp Vault's (HA) highly available Integrated storage (Raft) backend provides intra-cluster data replication across cluster members. Integrated Storage provides Vault with horizontal scalability and failure tolerance, but it does not provide backup for the entire cluster. Not utilizing disaster recovery for your production environment will negatively impact your organization's Recovery Point Objective (RPO) and Recovery Time Objective (RTO).

Potential issue:

If catastrophic failure occurs, there will be downtime and cost associated with not serving Vault clients in your environment.

Solution:

For cluster-wide issues (i.e., network connectivity), Vault Enterprise Disaster Recovery (DR) replication provides a warm standby cluster containing all primary cluster data. The DR cluster does not service reads or writes but you can promote it to replace the primary cluster when needed.

We also recommend that you periodically create data snapshots to protect against data corruption.

Not testing disaster recovery solution

Your disaster recovery (DR) solution is a key part of your overall disaster recovery plan. Designing and configuring your Vault disaster recovery solution is only the first step. You also need to validate the DR solution. Not doing so can negatively impact your organization's Recovery Point Objective (RPO) and Recovery Time Objective (RTO).

Potential issue:

If you don't test your disaster recovery solution, your key stakeholders will not feel confident they can effectively perform the disaster recovery plan. Testing the DR solution removes uncertainty if the DR plan will recover the system during an outage.

Solution:

Vault's Disaster Recovery (DR) replication mode provides a warm standby for failover if the primary cluster experiences catastrophic failure. You should periodically test the disaster recovery replication cluster by completing the failover and failback procedure.

It is important to have standard operating procedures for restoring a Vault cluster from a snapshot. The restoration methods following a DR situation would be in response to data corruption or sabotage, which Disaster Recovery Replication might be unable to protect against.

Standard procedure for restoring a Vault cluster

Slow upgrade cadence

While it might be easy to upgrade Vault whenever you have capacity, not having a frequent upgrade cadence can impact your Vault performance and security.

Potential issue:

Missing patches for bugs or vulnerabilities as documented in the CHANGELOG.
New features to improve workflow.
Must use version-specific rather than the latest documentation.
Some educational resources require a specific minimum Vault version.
Updates may require a stepped approach that uses an intermediate version before installing the latest binary.

Solution:

We recommend upgrading to our latest version of Vault. Subscribe to the releases in Vault's GitHub repository, and notifications from HashiCorp Vault discuss, will notify you when a new version of Vault is available.

Upgrading Vault without proper testing

We recommend testing Vault in a sandbox environment before deploying to production. Although it might be faster to upgrade immediately in production, testing will help identify any compatibility issues.

Be aware of the CHANGELOG and account for any new features, improvements, known issues and bug fixes in your testing.

Potential issue:

Without adequate testing before upgrading in production, you risk compatibility and performance issues. This could lead to downtime or degradation in your production Vault environment.

Solution:

Test new Vault versions in sandbox environments before upgrading in production and follow our upgrading documentation. We recommend adding a testing phase to your standard upgrade procedure.

Not rotating audit device logs

Audit devices in Vault maintain a detailed log of every authenticated requests and responses. If you allow the logs for audit devices to run perpetually without rotating you may face a blocked audit device.

Potential issue:

Vault will not respond to requests when no available (enabled) audit devices can record them. If the Audit log is not maintained and rotated over time it can consume the local storage.

Solution:

Inspect and rotate audit logs periodically.

Poor metrics or no telemetry data

Solely relying on Vault operational logs and data in Vault UI will give you a partial picture of how the cluster performs.

Potential issue:

Having a partial insight into cluster activity can leave the business in a reactive state.

Solution:

Continuous monitoring will allow organizations to detect minor problems and promptly resolve them. Migrating from reactive to proactive monitoring will help to prevent system failures. Vault has multiple outputs that help monitor the cluster's activity: audit logs, operational logs, and telemetry data. This data can work with a SIEM (security information and event management) tool for aggregation, inspection, and alerting capabilities.

Adding a monitoring solution:

Note

Vault logs to standard output and standard error by default. This is automatically captured by the systemd journal. Vault operational logs to can be directed to any file.

No baseline of activity or usage data

A baseline can provide insight into current utilization and thresholds. Telemetry metrics are valuable, especially when monitored over time. You can use telemetry metrics to gather a baseline of cluster activity, while alerts allow you to see when abnormal activity is present.

Potential issue:

This issue is closely linked to the poor metrics anti-pattern. Telemetry data is only held in memory for a short period of time.

Solution:

Telemetry information can also be streamed directly from Vault to a range of metrics aggregation solutions and saved for aggregation and inspection.

Using root token for routine actions

When you initialize a Vault server, it emits an initial root token that gives root-level access across all Vault features.

Potential issue:

The root tokens can perform all actions within Vault and never expire. Unrestricted access can give users higher privileges than necessary to all Vault operations and paths. There is a security risk with sharing and providing access to a root token.

Solution:

We recommend revoking the root token after initializing Vault within your environment. If elevated access is required, create policies that grant access to the proper paths in Vault. If the root token is required, only keep the token for the shortest time needed to operate.

Not rekeying Vault after key-holders exit

Vault's unseal keys are distributed to stakeholders. A quorum of keys is needed to unlock Vault based on your initialization settings.

Potential issue:

If multiple stakeholders leave the organization there is a risk of not meeting enough keys for quorum.

Solution:

Vault supports rekeying, depending on the seal type the process will defer.

Terraform adoption

Workspaces and projects