Kubernetes Operators: Build Your Own Operator Using Golang

Learn what Kubernetes Operators are, why they matter, and how to build your own custom operator from scratch using Golang and the Operator SDK.

Kubernetes Operators: Build Your Own Operator Using Golang illustration
On this page9 sections

Kubernetes has become the de facto standard for container orchestration, but managing complex stateful applications on Kubernetes often requires more than just Deployments and Services. That's where Kubernetes Operators come in โ€” they encode human operational knowledge into software that extends the Kubernetes API itself.

What is a Kubernetes Operator?

A Kubernetes Operator is a method of packaging, deploying, and managing a Kubernetes application using custom resources (CRs) and custom controllers. Think of it as a robot SRE that watches your cluster and takes actions to reconcile the actual state with the desired state you've declared.

The Operator pattern was introduced by CoreOS in 2016 and has since become the standard way to manage complex workloads. Popular operators include the Prometheus Operator, Cert-Manager, and the PostgreSQL Operator.

Kubernetes Operator Architecture
Kubernetes API ServerReceives CRD definitions and custom resources
Custom Resource Definition (CRD)Extends the API with your own resource types
Controller / ReconcilerWatches for changes, reconciles desired vs actual state
Managed ResourcesDeployments, Services, ConfigMaps created by the operator
Running WorkloadsPods, containers, your actual application

Core Concepts

Before building an operator, you need to understand these key concepts:

  • Custom Resource Definition (CRD): Extends the Kubernetes API with your own resource types. For example, you might define a Database resource with fields like engine, version, and replicas.
  • Controller: A loop that watches for changes to resources and takes action to move the current state toward the desired state. This is the brain of your operator.
  • Reconciliation Loop: The core logic of a controller. Every time a resource changes, the reconciler is called to ensure reality matches the spec.
  • Finalizers: Special metadata that prevent a resource from being deleted until cleanup logic has completed.

Setting Up Your Environment

To build an operator in Go, you'll use the Operator SDK, which provides scaffolding, code generation, and testing utilities.

# Install the Operator SDK CLI
brew install operator-sdk

# Or download the binary directly
export ARCH=$(case $(uname -m) in x86_64) echo -n amd64 ;; aarch64) echo -n arm64 ;; esac)
export OS=$(uname | awk '{print tolower($0)}')
curl -LO https://github.com/operator-framework/operator-sdk/releases/latest/download/operator-sdk_${OS}_${ARCH}
chmod +x operator-sdk_${OS}_${ARCH}
sudo mv operator-sdk_${OS}_${ARCH} /usr/local/bin/operator-sdk

# Verify installation
operator-sdk version

You'll also need Go 1.21+, Docker, kubectl, and access to a Kubernetes cluster (minikube or kind works great for development).

Scaffolding Your Operator Project

Let's build an operator that manages a custom AppService resource โ€” a simplified application deployment manager.

# Create a new project
mkdir appservice-operator && cd appservice-operator
operator-sdk init --domain example.com --repo github.com/yourname/appservice-operator

# Create an API (CRD + Controller)
operator-sdk create api --group apps --version v1alpha1 --kind AppService --resource --controller

This generates the project structure with boilerplate code, including the CRD types, controller skeleton, and Makefile targets.

Defining Your Custom Resource

Edit the generated types file at api/v1alpha1/appservice_types.go:

type AppServiceSpec struct {
    // Size is the number of replicas for the deployment
    Size int32 `json:"size"`

    // Image is the container image to deploy
    Image string `json:"image"`

    // Port is the port the application listens on
    Port int32 `json:"port,omitempty"`
}

type AppServiceStatus struct {
    // Conditions represent the latest available observations
    Conditions []metav1.Condition `json:"conditions,omitempty"`

    // AvailableReplicas is the number of pods ready
    AvailableReplicas int32 `json:"availableReplicas,omitempty"`
}

After modifying the types, regenerate the manifests:

make generate
make manifests
The Reconciliation Loop
๐Ÿ‘€Watch
๐Ÿ“ฉEvent
โš™Reconcile
๐Ÿ”Compare
โ™ป Loop
โœ…Converge

Implementing the Reconciliation Loop

The reconciler is where all the magic happens. Edit internal/controller/appservice_controller.go:

func (r *AppServiceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)

    // 1. Fetch the AppService instance
    appService := &appsv1alpha1.AppService{}
    if err := r.Get(ctx, req.NamespacedName, appService); err != nil {
        if apierrors.IsNotFound(err) {
            log.Info("AppService resource not found โ€” probably deleted")
            return ctrl.Result{}, nil
        }
        return ctrl.Result{}, err
    }

    // 2. Check if the Deployment already exists, create if not
    deployment := &appsv1.Deployment{}
    err := r.Get(ctx, types.NamespacedName{
        Name:      appService.Name,
        Namespace: appService.Namespace,
    }, deployment)

    if err != nil && apierrors.IsNotFound(err) {
        dep := r.createDeployment(appService)
        log.Info("Creating a new Deployment", "Name", dep.Name)
        if err = r.Create(ctx, dep); err != nil {
            return ctrl.Result{}, err
        }
        return ctrl.Result{Requeue: true}, nil
    }

    // 3. Ensure the replica count matches the spec
    if *deployment.Spec.Replicas != appService.Spec.Size {
        deployment.Spec.Replicas = &appService.Spec.Size
        if err = r.Update(ctx, deployment); err != nil {
            return ctrl.Result{}, err
        }
    }

    // 4. Update status
    appService.Status.AvailableReplicas = deployment.Status.AvailableReplicas
    if err := r.Status().Update(ctx, appService); err != nil {
        return ctrl.Result{}, err
    }

    return ctrl.Result{}, nil
}

Testing Your Operator

The Operator SDK generates test scaffolding using Ginkgo and envtest:

# Run unit tests
make test

# Run the operator locally against your cluster
make install  # Install CRDs
make run      # Run the controller locally

# In another terminal, create a sample resource
kubectl apply -f config/samples/apps_v1alpha1_appservice.yaml

# Watch it work
kubectl get appservice
kubectl get deployments
kubectl get pods
Operator Development Pipeline
๐Ÿ“Define CRDtypes.go
โ†’
โš™Generatemake manifests
โ†’
๐Ÿ’ปImplementcontroller.go
โ†’
๐ŸงชTestmake test
โ†’
๐Ÿš€Deploymake deploy

Deploying to Production

When you're ready to deploy the operator to a real cluster:

# Build and push the operator image
make docker-build docker-push IMG=yourregistry/appservice-operator:v0.1.0

# Deploy to the cluster
make deploy IMG=yourregistry/appservice-operator:v0.1.0

# Verify it's running
kubectl get pods -n appservice-operator-system

Best Practices

  • Idempotency: Your reconciler will be called multiple times. Every operation must be safe to repeat without side effects.
  • Status Subresource: Always update status via the status subresource, not the main resource. This avoids conflicts and follows Kubernetes conventions.
  • Owner References: Set owner references on child resources so they're automatically garbage collected when the parent is deleted.
  • RBAC: Follow the principle of least privilege. Only request the permissions your operator actually needs.
  • Error Handling: Return errors from the reconciler to trigger automatic requeue with exponential backoff.
  • Finalizers: Use finalizers for cleanup logic that must run before deletion (e.g., deleting external cloud resources).
  • Observability: Add metrics, structured logging, and events to make your operator debuggable in production.

Kubernetes Operators are one of the most powerful patterns in the cloud-native ecosystem. They let you automate complex operational tasks, enforce best practices, and build self-healing infrastructure. With Go and the Operator SDK, you have everything you need to start building production-grade operators today.

Share this article

Stuck on implementation?

Get private, 1-on-1 help with system design, performance, scaling, or any technical challenge.

Book a Session

Related Production Resources

Course

Free learning tracks

Turn this guide into a structured production engineering path.

Lab

Interactive engineering labs

Practice the same ideas through scenario-based simulators.

Reference

Production cheatsheets

Keep the operational commands and checks nearby.

Glossary

Key terms

Review the vocabulary behind the architecture.

Discussion

Questions, corrections, or production notes? Add them here so other learners can benefit.

Continue Reading

Related practical guides from the same production engineering path.

DevOps 8 min read

Modern Data Platforms Compared: Snowflake, Databricks, BigQuery, and e6data

Compare Snowflake, Databricks, BigQuery, and e6data through the production decisions that matter: storage, compute, governance, table formats, cost control, and workload fit.

Data Engineering Snowflake
DevOps 10 min read

Why Spark Jobs Become Slow: Shuffle, Skew, Partitions, and Memory

Spark jobs usually slow down for predictable reasons: too much shuffle, skewed keys, bad partition sizing, expensive file layouts, and memory pressure. Learn how to debug each one.

Spark Data Engineering