Kubernetes has become the de facto standard for container orchestration, but managing complex stateful applications on Kubernetes often requires more than just Deployments and Services. That's where Kubernetes Operators come in — they encode human operational knowledge into software that extends the Kubernetes API itself.
What is a Kubernetes Operator?
A Kubernetes Operator is a method of packaging, deploying, and managing a Kubernetes application using custom resources (CRs) and custom controllers. Think of it as a robot SRE that watches your cluster and takes actions to reconcile the actual state with the desired state you've declared.
The Operator pattern was introduced by CoreOS in 2016 and has since become the standard way to manage complex workloads. Popular operators include the Prometheus Operator, Cert-Manager, and the PostgreSQL Operator.
Core Concepts
Before building an operator, you need to understand these key concepts:
- Custom Resource Definition (CRD): Extends the Kubernetes API with your own resource types. For example, you might define a
Databaseresource with fields likeengine,version, andreplicas. - Controller: A loop that watches for changes to resources and takes action to move the current state toward the desired state. This is the brain of your operator.
- Reconciliation Loop: The core logic of a controller. Every time a resource changes, the reconciler is called to ensure reality matches the spec.
- Finalizers: Special metadata that prevent a resource from being deleted until cleanup logic has completed.
Setting Up Your Environment
To build an operator in Go, you'll use the Operator SDK, which provides scaffolding, code generation, and testing utilities.
# Install the Operator SDK CLI
brew install operator-sdk
# Or download the binary directly
export ARCH=$(case $(uname -m) in x86_64) echo -n amd64 ;; aarch64) echo -n arm64 ;; esac)
export OS=$(uname | awk '{print tolower($0)}')
curl -LO https://github.com/operator-framework/operator-sdk/releases/latest/download/operator-sdk_${OS}_${ARCH}
chmod +x operator-sdk_${OS}_${ARCH}
sudo mv operator-sdk_${OS}_${ARCH} /usr/local/bin/operator-sdk
# Verify installation
operator-sdk version
You'll also need Go 1.21+, Docker, kubectl, and access to a Kubernetes cluster (minikube or kind works great for development).
Scaffolding Your Operator Project
Let's build an operator that manages a custom AppService resource — a simplified application deployment manager.
# Create a new project
mkdir appservice-operator && cd appservice-operator
operator-sdk init --domain example.com --repo github.com/yourname/appservice-operator
# Create an API (CRD + Controller)
operator-sdk create api --group apps --version v1alpha1 --kind AppService --resource --controller
This generates the project structure with boilerplate code, including the CRD types, controller skeleton, and Makefile targets.
Defining Your Custom Resource
Edit the generated types file at api/v1alpha1/appservice_types.go:
type AppServiceSpec struct {
// Size is the number of replicas for the deployment
Size int32 `json:"size"`
// Image is the container image to deploy
Image string `json:"image"`
// Port is the port the application listens on
Port int32 `json:"port,omitempty"`
}
type AppServiceStatus struct {
// Conditions represent the latest available observations
Conditions []metav1.Condition `json:"conditions,omitempty"`
// AvailableReplicas is the number of pods ready
AvailableReplicas int32 `json:"availableReplicas,omitempty"`
}
After modifying the types, regenerate the manifests:
make generate
make manifests
Implementing the Reconciliation Loop
The reconciler is where all the magic happens. Edit internal/controller/appservice_controller.go:
func (r *AppServiceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the AppService instance
appService := &appsv1alpha1.AppService{}
if err := r.Get(ctx, req.NamespacedName, appService); err != nil {
if apierrors.IsNotFound(err) {
log.Info("AppService resource not found — probably deleted")
return ctrl.Result{}, nil
}
return ctrl.Result{}, err
}
// 2. Check if the Deployment already exists, create if not
deployment := &appsv1.Deployment{}
err := r.Get(ctx, types.NamespacedName{
Name: appService.Name,
Namespace: appService.Namespace,
}, deployment)
if err != nil && apierrors.IsNotFound(err) {
dep := r.createDeployment(appService)
log.Info("Creating a new Deployment", "Name", dep.Name)
if err = r.Create(ctx, dep); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{Requeue: true}, nil
}
// 3. Ensure the replica count matches the spec
if *deployment.Spec.Replicas != appService.Spec.Size {
deployment.Spec.Replicas = &appService.Spec.Size
if err = r.Update(ctx, deployment); err != nil {
return ctrl.Result{}, err
}
}
// 4. Update status
appService.Status.AvailableReplicas = deployment.Status.AvailableReplicas
if err := r.Status().Update(ctx, appService); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}
Testing Your Operator
The Operator SDK generates test scaffolding using Ginkgo and envtest:
# Run unit tests
make test
# Run the operator locally against your cluster
make install # Install CRDs
make run # Run the controller locally
# In another terminal, create a sample resource
kubectl apply -f config/samples/apps_v1alpha1_appservice.yaml
# Watch it work
kubectl get appservice
kubectl get deployments
kubectl get pods
Deploying to Production
When you're ready to deploy the operator to a real cluster:
# Build and push the operator image
make docker-build docker-push IMG=yourregistry/appservice-operator:v0.1.0
# Deploy to the cluster
make deploy IMG=yourregistry/appservice-operator:v0.1.0
# Verify it's running
kubectl get pods -n appservice-operator-system
Best Practices
- Idempotency: Your reconciler will be called multiple times. Every operation must be safe to repeat without side effects.
- Status Subresource: Always update status via the status subresource, not the main resource. This avoids conflicts and follows Kubernetes conventions.
- Owner References: Set owner references on child resources so they're automatically garbage collected when the parent is deleted.
- RBAC: Follow the principle of least privilege. Only request the permissions your operator actually needs.
- Error Handling: Return errors from the reconciler to trigger automatic requeue with exponential backoff.
- Finalizers: Use finalizers for cleanup logic that must run before deletion (e.g., deleting external cloud resources).
- Observability: Add metrics, structured logging, and events to make your operator debuggable in production.
Kubernetes Operators are one of the most powerful patterns in the cloud-native ecosystem. They let you automate complex operational tasks, enforce best practices, and build self-healing infrastructure. With Go and the Operator SDK, you have everything you need to start building production-grade operators today.