k3s

Part 5 - K3s Zero to Hero: Advanced K3s Management - Monitoring, Scaling, and Upgrades

AB Engineering

12 Jun 2025 • 7 min read

If you've been following along with our K3s journey, congratulations! You've gone from absolute beginner in Part 1 to deploying real applications in Part 4. Now it's time to level up from "it works on my machine" to "it works reliably at 3 AM when something inevitably breaks." Welcome to the world of advanced K3s management, where we'll transform your cluster from a digital pet into production-grade cattle.

The Observability Trinity: Prometheus, Grafana, and Loki

Remember when you first installed K3s with that magical one-liner curl -sfL https://get.k3s.io | sh -? Those were simpler times. Now we need to actually see what's happening inside our cluster, preferably before things catch fire.

Setting Up Prometheus: Your Cluster's Health Monitor

Prometheus is like having a very obsessive friend who remembers every detail about your cluster's performance. It collects metrics from everywhere and stores them with timestamps, creating a treasure trove of data that would make any data scientist weep with joy.

Let's install the monitoring stack using Helm, which we hopefully covered in Part 4. If you don't have Helm installed yet, you can grab it quickly:

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Now, create a monitoring namespace and install the Prometheus stack:

kubectl create namespace monitoring
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring

This command installs what I like to call the "holy trinity" of observability: Prometheus for metrics collection, Grafana for visualization, and AlertManager for shouting at you when things go wrong.

Grafana: Making Data Beautiful Again

Once installed, you'll want to access Grafana's web interface. Remember how we set up Ingress in Part 4? This is where that knowledge pays off. But for quick access, let's use port forwarding:

kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Navigate to http://localhost:3000 and login with the default credentials (admin/prom-operator). The first thing you'll notice is that Grafana comes pre-loaded with dashboards that make your cluster look like a NASA mission control center.

Adding Loki for Log Aggregation

Metrics are great, but sometimes you need to see the actual logs to understand why your application decided to take a coffee break. Loki is like Prometheus's cousin who specializes in collecting and storing logs without breaking the bank.

Install Loki using the simplest method available:

kubectl create namespace logging
# Using arkade for easy installation
curl -sLS https://get.arkade.dev | sudo sh
arkade install loki -n logging --persistence

This installs Loki with 10GB of persistent storage and Promtail DaemonSets that run on every node, silently collecting logs like digital vacuum cleaners.

Verify everything is running:

kubectl get pods -n logging

You should see Promtail pods running on each node and a Loki pod storing all those precious logs.

Scaling: When Your Cluster Needs to Grow (Or Shrink)

Scaling in Kubernetes comes in two flavors: horizontal (more pods) and vertical (bigger pods). It's like choosing between hiring more people or giving your existing team more coffee.

Horizontal Pod Autoscaling: The Smart Way to Scale

The Horizontal Pod Autoscaler (HPA) is Kubernetes' answer to the age-old question: "How many pods do I actually need?" It automatically adjusts the number of pod replicas based on CPU utilization, memory usage, or custom metrics.

Here's how to set up HPA for a deployment. First, ensure your deployment has resource requests defined:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
        ports:
        - containerPort: 80

Now create an HPA that scales based on CPU usage:

kubectl autoscale deployment nginx-deployment --cpu-percent=50 --min=2 --max=10

This tells Kubernetes: "Keep CPU usage around 50% by scaling between 2 and 10 pods." It's like having a very attentive waiter who notices when you're running low on computing power.

Cluster Autoscaling: When You Need More Nodes

Sometimes you need more than just additional pods; you need entire new nodes. Cluster autoscaling automatically adds or removes nodes based on resource demands.

For cloud providers, you can configure node autoscaling in your cluster configuration. Here's an example for a hypothetical cloud setup:

worker_node_pools:
  - name: autoscaled-workers
    instance_type: medium
    instance_count: 2
    location: datacenter-1
    autoscaling:
      enabled: true
      min_instances: 1
      max_instances: 5

This configuration allows your cluster to scale from 1 to 5 worker nodes automatically. It's like having a magical datacenter that expands and contracts based on your needs, minus the actual magic.

Cluster Upgrades: Because Staying Current Matters

Upgrading Kubernetes clusters has historically been about as fun as root canal surgery. Fortunately, K3s makes this process significantly less painful with the system upgrade controller.

Automated Upgrades with System Upgrade Controller

First, install the upgrade controller:

kubectl create ns system-upgrade
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/crd.yaml

Now create an upgrade plan. Remember, always upgrade one minor version at a time (from 1.28 to 1.29, not 1.25 to 1.29 in one leap):

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: server-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
    - key: node-role.kubernetes.io/control-plane
      operator: In
      values:
      - "true"
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
    version: v1.29.4+k3s1
---
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: agent-plan
  namespace: system-upgrade
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
    - key: node-role.kubernetes.io/control-plane
      operator: DoesNotExist
  prepare:
    args:
    - prepare
    - server-plan
    image: rancher/k3s-upgrade
  serviceAccountName: system-upgrade
  upgrade:
    image: rancher/k3s-upgrade
    version: v1.29.4+k3s1

Apply this plan, and the controller will orchestrate the upgrade process, cordoning nodes, upgrading them one by one, and uncordoning them when complete. It's like having a very careful and methodical robot perform surgery on your cluster.

Backup and Disaster Recovery: Your Insurance Policy

Murphy's Law states that anything that can go wrong will go wrong, usually at the worst possible moment. This is why backups exist.

etcd Snapshots: Your Cluster's Time Machine

K3s uses etcd as its datastore (if you enabled cluster mode with --cluster-init as we discussed in Part 2). etcd snapshots are like save points in a video game, except the consequences of losing your progress are slightly more serious.

Enable scheduled snapshots by modifying your K3s configuration:

# /etc/rancher/k3s/config.yaml
etcd-snapshot-dir: /var/lib/rancher/k3s/db/snapshots
etcd-snapshot-retention: 10
etcd-snapshot-schedule-cron: "0 4 * * *"  # Daily at 4 AM
etcd-snapshot-compress: true

For the truly paranoid (which in production, you should be), configure S3 backups. First, create a secret with your S3 credentials:

apiVersion: v1
kind: Secret
metadata:
  name: etcd-backup-config
  namespace: kube-system
data:
  etcd-s3-access-key: 
  etcd-s3-bucket: "k3s-etcd-backups"
  etcd-s3-endpoint: 
  etcd-s3-region: "auto"
  etcd-s3-secret-key:

Then enable S3 backups in your K3s config:

etcd-s3-config-secret: etcd-backup-config
etcd-s3: true

Restart K3s, and your snapshots will be automatically uploaded to S3. It's like having a digital safe deposit box for your cluster's soul.

Manual Backup Testing

Create a manual snapshot to test your backup system:

k3s etcd-snapshot save --name manual-backup-$(date +%Y%m%d-%H%M%S)

List your snapshots to verify:

k3s etcd-snapshot ls

Performance Monitoring and Alerting

Now that you have monitoring set up, you need to know when things go sideways before your users start complaining (or worse, before they stop complaining and just leave).

Setting Up Alerts in Grafana

Navigate to your Grafana instance and create alert rules for critical metrics:

High CPU usage (>80% for 5 minutes)
Memory usage approaching limits (>90%)
Pod restart loops
Node unavailability

Configure notification channels to alert via email, Slack, or whatever communication method ensures you'll actually see the alert.

Custom Dashboards

While the default dashboards are impressive, you'll want to create custom ones for your specific applications. Import community dashboards or create your own to monitor application-specific metrics.

Load Testing and Performance Validation

Before you declare victory, test your scaling setup with some load. Tools like Apache Bench or hey can simulate traffic:

# Install hey for load testing
go install github.com/rakyll/hey@latest

# Test your nginx deployment
kubectl port-forward svc/nginx-service 8080:80
hey -n 10000 -c 100 http://localhost:8080

Watch your HPA respond to the increased load by scaling up pods, then scale back down when the load subsides.

Putting It All Together

You now have a K3s cluster that can observe itself, scale automatically, upgrade gracefully, and recover from disasters. It's evolved from the simple single-node setup we created in Part 1 to a sophisticated, production-ready platform.

Your cluster now has:

Comprehensive monitoring with Prometheus and Grafana
Log aggregation with Loki
Automatic pod scaling with HPA
Automated upgrades with the system upgrade controller
Reliable backups with etcd snapshots

The journey from "curl | sh" to a fully managed K3s cluster demonstrates the power of starting simple and adding complexity only when needed. Your cluster is now ready for whatever workloads you throw at it, and more importantly, it's ready to tell you when something goes wrong.

Remember, monitoring without action is just expensive data collection. Use these tools not just to admire pretty graphs, but to understand your applications' behavior and improve their reliability. Your future self (especially the one who gets called at 2 AM) will thank you for taking the time to set this up properly.

The best part? Everything we've covered builds on the foundation from Parts 1-4. Your single-node cluster from Part 1 can run this entire monitoring stack. The multi-node setup from Part 2 gives you redundancy. The configuration skills from Part 3 let you tune everything perfectly. And the application deployment knowledge from Part 4 makes this all worthwhile.

Now go forth and monitor responsibly. Your cluster is no longer just running; it's running with style, intelligence, and a comprehensive understanding of its own health. Welcome to the world of truly advanced K3s management.