The DevOps Toolchain
The toolchain is not DevOps. You can buy every tool on this page and still have a dysfunctional organization. But you cannot practice DevOps without tools -- automation requires something to automate with. This file maps the major categories of the DevOps toolchain, the dominant tools in each category, and the principles for choosing between them.
The Categories
Every DevOps toolchain covers the same set of concerns, regardless of which specific tools are used:
Category Purpose Key Tools
--------------------------------------------------------------------------
Source Control Track code changes Git, GitHub, GitLab
CI/CD Build, test, deploy GitHub Actions, GitLab CI, Jenkins
Containers Package applications Docker, Podman
Orchestration Run containers at scale Kubernetes, Nomad, ECS
IaC Provision infrastructure Terraform, Pulumi, CloudFormation
Configuration Mgmt Configure servers Ansible, Chef, Puppet
Monitoring Collect & visualize metrics Prometheus, Grafana, Datadog
Logging Aggregate & search logs ELK Stack, Loki, Splunk
Tracing Follow requests across services Jaeger, Zipkin, OpenTelemetry
Secrets Manage sensitive data HashiCorp Vault, AWS Secrets Manager
Artifact Storage Store build outputs Docker registries, Nexus, Artifactory
Source Control: Git
Git is the universal standard. The question is not whether to use Git but where to host it.
GitHub is the default for open source and increasingly for enterprises. GitHub Actions (CI/CD) is tightly integrated. GitHub Copilot and the broader ecosystem make it the center of many developer workflows.
GitLab is a single platform that bundles source control, CI/CD, container registry, and more. Organizations that want a single vendor for the entire pipeline often choose GitLab.
Bitbucket is common in Atlassian shops (paired with Jira) but has less momentum than GitHub or GitLab.
# The basics never change
git clone git@github.com:org/repo.git
git checkout -b feature/add-caching
# ... make changes ...
git add -A
git commit -m "Add Redis caching layer"
git push origin feature/add-caching
# Open a pull request, get it reviewed, merge
The branching strategy matters more than the hosting platform. Trunk-based development (short-lived branches, frequent merges to main) is strongly correlated with high-performing teams in the DORA research.
CI/CD: GitHub Actions & GitLab CI
Continuous Integration (CI) means every commit is built and tested automatically. Continuous Delivery (CD) means every commit that passes CI is deployable. Continuous Deployment goes further: every passing commit is automatically deployed.
GitHub Actions
# .github/workflows/ci.yml
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
build-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.22'
- run: go build ./...
- run: go test ./...
GitLab CI
# .gitlab-ci.yml
stages:
- build
- test
build:
stage: build
image: golang:1.22
script:
- go build ./...
test:
stage: test
image: golang:1.22
script:
- go test ./...
Jenkins still exists in many organizations but is losing ground. It requires significant maintenance (plugins, updates, server management) and its Groovy-based pipeline syntax is notoriously painful. If you are starting fresh, choose GitHub Actions or GitLab CI.
Containers: Docker
Docker packages an application and its dependencies into a portable image. Containers are not virtual machines -- they share the host kernel and are much lighter weight.
FROM golang:1.22-alpine AS build
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /server .
FROM alpine:3.19
COPY --from=build /server /server
EXPOSE 8080
CMD ["/server"]
Podman is a Docker-compatible alternative that runs without a daemon and can run rootless by default. It is gaining adoption in security-conscious environments.
Docker images are stored in registries: Docker Hub (public), Amazon ECR, Google Artifact Registry, GitHub Container Registry, or self-hosted registries like Harbor.
Orchestration: Kubernetes
Kubernetes manages containers at scale: scheduling, scaling, networking, load balancing, and self-healing. It is the dominant orchestration platform.
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api-server
image: ghcr.io/org/api-server:v1.2.3
ports:
- containerPort: 8080
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
Alternatives: HashiCorp Nomad (simpler, supports non-container workloads), Amazon ECS (AWS-native, less operational overhead), and Docker Swarm (simpler but largely abandoned). For most organizations, the choice is between Kubernetes and a managed container service like ECS or Cloud Run.
Infrastructure as Code: Terraform
Terraform lets you define infrastructure (servers, databases, networks, DNS records) in declarative configuration files. You describe the desired state; Terraform figures out how to get there.
# main.tf
provider "aws" {
region = "us-east-1"
}
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.micro"
tags = {
Name = "web-server"
}
}
resource "aws_s3_bucket" "assets" {
bucket = "my-app-assets"
}
terraform init # Download providers
terraform plan # Preview changes
terraform apply # Apply changes
terraform destroy # Tear down everything
Alternatives: Pulumi (IaC using real programming languages like TypeScript or Python), AWS CloudFormation (AWS-only), and Google Cloud Deployment Manager (GCP-only). Terraform's strength is multi-cloud support and a massive provider ecosystem.
Monitoring: Prometheus & Grafana
Prometheus collects time-series metrics by scraping HTTP endpoints. Grafana visualizes them. Together they form the most common open-source monitoring stack.
# prometheus.yml
scrape_configs:
- job_name: 'api-server'
scrape_interval: 15s
static_configs:
- targets: ['api-server:8080']
Datadog and New Relic are popular SaaS alternatives. They cost more but require less operational effort. For small teams, a managed service is usually worth the money. For large organizations, Prometheus + Grafana (often via the Grafana Cloud managed service) is the standard.
Logging: The ELK Stack & Loki
The ELK stack (Elasticsearch, Logstash, Kibana) was the standard for years. Elasticsearch indexes logs for search; Logstash processes them; Kibana visualizes them.
Grafana Loki is a newer alternative that stores logs more efficiently by indexing only metadata (labels), not the full log content. It pairs naturally with Grafana and Prometheus.
# Structured logging makes searching easier
echo '{"level":"error","msg":"payment failed","user_id":"abc123","error":"timeout"}' >> /var/log/app.log
# vs unstructured logging
echo "ERROR: payment failed for user abc123: timeout" >> /var/log/app.log
Always use structured logging (JSON). It makes searching and aggregating logs orders of magnitude easier.
Secrets: HashiCorp Vault
Secrets (API keys, database passwords, TLS certificates) must never be committed to source control. Vault is the most widely used secrets management tool.
# Store a secret
vault kv put secret/myapp db_password="s3cret"
# Retrieve a secret
vault kv get -field=db_password secret/myapp
Alternatives: AWS Secrets Manager, GCP Secret Manager, Azure Key Vault (cloud-native options), and SOPS (encrypts secrets in files, works well with Git). For Kubernetes, the External Secrets Operator syncs secrets from Vault or cloud providers into Kubernetes Secrets.
Choosing Tools: Boring Is Better
The most important principle for toolchain selection is: choose boring technology. A tool that has been around for years, has extensive documentation, a large community, and known failure modes is almost always better than the latest thing on Hacker News.
Decision framework:
1. Does the tool solve a real problem you have today?
(Not a problem you might have at 10x scale.)
2. Is the tool well-documented and widely adopted?
(Can you hire people who know it?)
3. Does the tool integrate with your existing stack?
(Or does it require rearchitecting everything?)
4. Can you operate the tool?
(Self-hosting Elasticsearch requires expertise. Can you afford that?)
5. Is the tool still actively maintained?
(Check commit history, release cadence, community activity.)
The toolchain is a means, not an end. The goal is to ship software reliably and quickly. If your current tools are doing that, resist the urge to replace them with something newer.
Common Pitfalls
- Tool sprawl -- Every team picks their own CI system, their own monitoring stack, their own secret management. The result is an un-debuggable mess. Standardize on one tool per category.
- Resume-driven decisions -- Choosing Kubernetes because it looks good on a resume when Docker Compose on a single server handles your load. Choose tools based on the problem, not the hype.
- Ignoring operational cost -- Self-hosting Prometheus, Elasticsearch, and Vault requires a team to maintain them. If you have 5 engineers total, use managed services.
- Conflating tools with practices -- Having GitHub Actions does not mean you do CI. CI means every commit is tested. The tool enables the practice; the tool is not the practice.
- Migrating too often -- Switching from Jenkins to GitHub Actions to GitLab CI in two years means you never get good at any of them. Pick a tool and invest in mastering it.
Key Takeaways
- The DevOps toolchain has well-defined categories: source control, CI/CD, containers, orchestration, IaC, monitoring, logging, secrets
- Git is universal. For hosting, GitHub and GitLab dominate. For CI/CD, choose whichever is native to your source control platform.
- Docker is the standard for containers. Kubernetes is the standard for orchestration -- but not every organization needs it.
- Terraform is the most widely adopted IaC tool. Prometheus + Grafana is the standard open-source monitoring stack.
- Choose boring technology: widely adopted, well-documented, and maintainable by your team
- Standardize within your organization. One tool per category is almost always better than letting every team choose their own.