Kubernetes Operators Guide

Kubernetes has become the de facto standard for container orchestration, but managing complex stateful applications on Kubernetes often requires more than just Deployments and Services. That's where Kubernetes Operators come in — they encode human operational knowledge into software that extends the Kubernetes API itself.

What is a Kubernetes Operator?

A Kubernetes Operator is a method of packaging, deploying, and managing a Kubernetes application using custom resources (CRs) and custom controllers. Think of it as a robot SRE that watches your cluster and takes actions to reconcile the actual state with the desired state you've declared.

The Operator pattern was introduced by CoreOS in 2016 and has since become the standard way to manage complex workloads. Popular operators include the Prometheus Operator, Cert-Manager, and the PostgreSQL Operator.

Kubernetes Operator Architecture

Kubernetes API ServerReceives CRD definitions and custom resources

Custom Resource Definition (CRD)Extends the API with your own resource types

Controller / ReconcilerWatches for changes, reconciles desired vs actual state

Managed ResourcesDeployments, Services, ConfigMaps created by the operator

Running WorkloadsPods, containers, your actual application

Core Concepts

Before building an operator, you need to understand these key concepts:

Custom Resource Definition (CRD): Extends the Kubernetes API with your own resource types. For example, you might define a Database resource with fields like engine, version, and replicas.
Controller: A loop that watches for changes to resources and takes action to move the current state toward the desired state. This is the brain of your operator.
Reconciliation Loop: The core logic of a controller. Every time a resource changes, the reconciler is called to ensure reality matches the spec.
Finalizers: Special metadata that prevent a resource from being deleted until cleanup logic has completed.

Setting Up Your Environment

To build an operator in Go, you'll use the Operator SDK, which provides scaffolding, code generation, and testing utilities.

# Install the Operator SDK CLI
brew install operator-sdk

# Or download the binary directly
export ARCH=$(case $(uname -m) in x86_64) echo -n amd64 ;; aarch64) echo -n arm64 ;; esac)
export OS=$(uname | awk '{print tolower($0)}')
curl -LO https://github.com/operator-framework/operator-sdk/releases/latest/download/operator-sdk_${OS}_${ARCH}
chmod +x operator-sdk_${OS}_${ARCH}
sudo mv operator-sdk_${OS}_${ARCH} /usr/local/bin/operator-sdk

# Verify installation
operator-sdk version

You'll also need Go 1.21+, Docker, kubectl, and access to a Kubernetes cluster (minikube or kind works great for development).

Scaffolding Your Operator Project

Let's build an operator that manages a custom AppService resource — a simplified application deployment manager.

# Create a new project
mkdir appservice-operator && cd appservice-operator
operator-sdk init --domain example.com --repo github.com/yourname/appservice-operator

# Create an API (CRD + Controller)
operator-sdk create api --group apps --version v1alpha1 --kind AppService --resource --controller

This generates the project structure with boilerplate code, including the CRD types, controller skeleton, and Makefile targets.

Defining Your Custom Resource

Edit the generated types file at api/v1alpha1/appservice_types.go:

type AppServiceSpec struct {
    // Size is the number of replicas for the deployment
    Size int32 `json:"size"`

    // Image is the container image to deploy
    Image string `json:"image"`

    // Port is the port the application listens on
    Port int32 `json:"port,omitempty"`
}

type AppServiceStatus struct {
    // Conditions represent the latest available observations
    Conditions []metav1.Condition `json:"conditions,omitempty"`

    // AvailableReplicas is the number of pods ready
    AvailableReplicas int32 `json:"availableReplicas,omitempty"`
}

After modifying the types, regenerate the manifests:

make generate
make manifests

The Reconciliation Loop

👀Watch

📩Event

⚙Reconcile

🔍Compare

♻ Loop

✅Converge

Implementing the Reconciliation Loop

The reconciler is where all the magic happens. Edit internal/controller/appservice_controller.go:

func (r *AppServiceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)

    // 1. Fetch the AppService instance
    appService := &appsv1alpha1.AppService{}
    if err := r.Get(ctx, req.NamespacedName, appService); err != nil {
        if apierrors.IsNotFound(err) {
            log.Info("AppService resource not found — probably deleted")
            return ctrl.Result{}, nil
        }
        return ctrl.Result{}, err
    }

    // 2. Check if the Deployment already exists, create if not
    deployment := &appsv1.Deployment{}
    err := r.Get(ctx, types.NamespacedName{
        Name:      appService.Name,
        Namespace: appService.Namespace,
    }, deployment)

    if err != nil && apierrors.IsNotFound(err) {
        dep := r.createDeployment(appService)
        log.Info("Creating a new Deployment", "Name", dep.Name)
        if err = r.Create(ctx, dep); err != nil {
            return ctrl.Result{}, err
        }
        return ctrl.Result{Requeue: true}, nil
    }

    // 3. Ensure the replica count matches the spec
    if *deployment.Spec.Replicas != appService.Spec.Size {
        deployment.Spec.Replicas = &appService.Spec.Size
        if err = r.Update(ctx, deployment); err != nil {
            return ctrl.Result{}, err
        }
    }

    // 4. Update status
    appService.Status.AvailableReplicas = deployment.Status.AvailableReplicas
    if err := r.Status().Update(ctx, appService); err != nil {
        return ctrl.Result{}, err
    }

    return ctrl.Result{}, nil
}

Testing Your Operator

The Operator SDK generates test scaffolding using Ginkgo and envtest:

# Run unit tests
make test

# Run the operator locally against your cluster
make install  # Install CRDs
make run      # Run the controller locally

# In another terminal, create a sample resource
kubectl apply -f config/samples/apps_v1alpha1_appservice.yaml

# Watch it work
kubectl get appservice
kubectl get deployments
kubectl get pods

Operator Development Pipeline

📝Define CRDtypes.go

→

⚙Generatemake manifests

→

💻Implementcontroller.go

→

🧪Testmake test

→

🚀Deploymake deploy

Deploying to Production

When you're ready to deploy the operator to a real cluster:

# Build and push the operator image
make docker-build docker-push IMG=yourregistry/appservice-operator:v0.1.0

# Deploy to the cluster
make deploy IMG=yourregistry/appservice-operator:v0.1.0

# Verify it's running
kubectl get pods -n appservice-operator-system

Best Practices

Idempotency: Your reconciler will be called multiple times. Every operation must be safe to repeat without side effects.
Status Subresource: Always update status via the status subresource, not the main resource. This avoids conflicts and follows Kubernetes conventions.
Owner References: Set owner references on child resources so they're automatically garbage collected when the parent is deleted.
RBAC: Follow the principle of least privilege. Only request the permissions your operator actually needs.
Error Handling: Return errors from the reconciler to trigger automatic requeue with exponential backoff.
Finalizers: Use finalizers for cleanup logic that must run before deletion (e.g., deleting external cloud resources).
Observability: Add metrics, structured logging, and events to make your operator debuggable in production.

Kubernetes Operators are one of the most powerful patterns in the cloud-native ecosystem. They let you automate complex operational tasks, enforce best practices, and build self-healing infrastructure. With Go and the Operator SDK, you have everything you need to start building production-grade operators today.

Kubernetes Operators: Build Your Own Operator Using Golang

What is a Kubernetes Operator?

Core Concepts

Setting Up Your Environment

Scaffolding Your Operator Project

Defining Your Custom Resource

Implementing the Reconciliation Loop

Testing Your Operator

Deploying to Production

Best Practices

Stuck on implementation?

Related Production Resources

Free learning tracks

Interactive engineering labs

Production cheatsheets

Key terms

Discussion

Discussion is unavailable

What is a Kubernetes Operator?

Core Concepts

Setting Up Your Environment

Scaffolding Your Operator Project

Defining Your Custom Resource

Implementing the Reconciliation Loop

Testing Your Operator

Deploying to Production

Best Practices

Stuck on implementation?

Related Production Resources

Free learning tracks

Interactive engineering labs

Production cheatsheets

Key terms

Discussion

Discussion is unavailable

Continue Reading

Modern Data Platforms Compared: Snowflake, Databricks, BigQuery, and e6data

Why Spark Jobs Become Slow: Shuffle, Skew, Partitions, and Memory