Part 5 - K3s Zero to Hero: Advanced K3s Management - Monitoring, Scaling, and Upgrades

If you've been following along with our K3s journey, congratulations! You've gone from absolute beginner in Part 1 to deploying real applications in Part 4. Now it's time to level up from "it works on my machine" to "it works reliably at 3 AM when something inevitably breaks." Welcome to the world of advanced K3s management, where we'll transform your cluster from a digital pet into production-grade cattle.
The Observability Trinity: Prometheus, Grafana, and Loki
Remember when you first installed K3s with that magical one-liner curl -sfL https://get.k3s.io | sh -
? Those were simpler times. Now we need to actually see what's happening inside our cluster, preferably before things catch fire.
Setting Up Prometheus: Your Cluster's Health Monitor
Prometheus is like having a very obsessive friend who remembers every detail about your cluster's performance. It collects metrics from everywhere and stores them with timestamps, creating a treasure trove of data that would make any data scientist weep with joy.
Let's install the monitoring stack using Helm, which we hopefully covered in Part 4. If you don't have Helm installed yet, you can grab it quickly:
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
Now, create a monitoring namespace and install the Prometheus stack:
kubectl create namespace monitoring
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring
This command installs what I like to call the "holy trinity" of observability: Prometheus for metrics collection, Grafana for visualization, and AlertManager for shouting at you when things go wrong.
Grafana: Making Data Beautiful Again
Once installed, you'll want to access Grafana's web interface. Remember how we set up Ingress in Part 4? This is where that knowledge pays off. But for quick access, let's use port forwarding:
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
Navigate to http://localhost:3000
and login with the default credentials (admin/prom-operator). The first thing you'll notice is that Grafana comes pre-loaded with dashboards that make your cluster look like a NASA mission control center.
Adding Loki for Log Aggregation
Metrics are great, but sometimes you need to see the actual logs to understand why your application decided to take a coffee break. Loki is like Prometheus's cousin who specializes in collecting and storing logs without breaking the bank.
Install Loki using the simplest method available:
kubectl create namespace logging
# Using arkade for easy installation
curl -sLS https://get.arkade.dev | sudo sh
arkade install loki -n logging --persistence
This installs Loki with 10GB of persistent storage and Promtail DaemonSets that run on every node, silently collecting logs like digital vacuum cleaners.
Verify everything is running:
kubectl get pods -n logging
You should see Promtail pods running on each node and a Loki pod storing all those precious logs.
Scaling: When Your Cluster Needs to Grow (Or Shrink)
Scaling in Kubernetes comes in two flavors: horizontal (more pods) and vertical (bigger pods). It's like choosing between hiring more people or giving your existing team more coffee.
Horizontal Pod Autoscaling: The Smart Way to Scale
The Horizontal Pod Autoscaler (HPA) is Kubernetes' answer to the age-old question: "How many pods do I actually need?" It automatically adjusts the number of pod replicas based on CPU utilization, memory usage, or custom metrics.
Here's how to set up HPA for a deployment. First, ensure your deployment has resource requests defined:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:latest
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
ports:
- containerPort: 80
Now create an HPA that scales based on CPU usage:
kubectl autoscale deployment nginx-deployment --cpu-percent=50 --min=2 --max=10
This tells Kubernetes: "Keep CPU usage around 50% by scaling between 2 and 10 pods." It's like having a very attentive waiter who notices when you're running low on computing power.
Cluster Autoscaling: When You Need More Nodes
Sometimes you need more than just additional pods; you need entire new nodes. Cluster autoscaling automatically adds or removes nodes based on resource demands.
For cloud providers, you can configure node autoscaling in your cluster configuration. Here's an example for a hypothetical cloud setup:
worker_node_pools:
- name: autoscaled-workers
instance_type: medium
instance_count: 2
location: datacenter-1
autoscaling:
enabled: true
min_instances: 1
max_instances: 5
This configuration allows your cluster to scale from 1 to 5 worker nodes automatically. It's like having a magical datacenter that expands and contracts based on your needs, minus the actual magic.
Cluster Upgrades: Because Staying Current Matters
Upgrading Kubernetes clusters has historically been about as fun as root canal surgery. Fortunately, K3s makes this process significantly less painful with the system upgrade controller.
Automated Upgrades with System Upgrade Controller
First, install the upgrade controller:
kubectl create ns system-upgrade
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/crd.yaml
Now create an upgrade plan. Remember, always upgrade one minor version at a time (from 1.28 to 1.29, not 1.25 to 1.29 in one leap):
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: server-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: In
values:
- "true"
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
version: v1.29.4+k3s1
---
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: agent-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
prepare:
args:
- prepare
- server-plan
image: rancher/k3s-upgrade
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
version: v1.29.4+k3s1
Apply this plan, and the controller will orchestrate the upgrade process, cordoning nodes, upgrading them one by one, and uncordoning them when complete. It's like having a very careful and methodical robot perform surgery on your cluster.
Backup and Disaster Recovery: Your Insurance Policy
Murphy's Law states that anything that can go wrong will go wrong, usually at the worst possible moment. This is why backups exist.
etcd Snapshots: Your Cluster's Time Machine
K3s uses etcd as its datastore (if you enabled cluster mode with --cluster-init
as we discussed in Part 2). etcd snapshots are like save points in a video game, except the consequences of losing your progress are slightly more serious.
Enable scheduled snapshots by modifying your K3s configuration:
# /etc/rancher/k3s/config.yaml
etcd-snapshot-dir: /var/lib/rancher/k3s/db/snapshots
etcd-snapshot-retention: 10
etcd-snapshot-schedule-cron: "0 4 * * *" # Daily at 4 AM
etcd-snapshot-compress: true
For the truly paranoid (which in production, you should be), configure S3 backups. First, create a secret with your S3 credentials:
apiVersion: v1
kind: Secret
metadata:
name: etcd-backup-config
namespace: kube-system
data:
etcd-s3-access-key:
etcd-s3-bucket: "k3s-etcd-backups"
etcd-s3-endpoint:
etcd-s3-region: "auto"
etcd-s3-secret-key:
Then enable S3 backups in your K3s config:
etcd-s3-config-secret: etcd-backup-config
etcd-s3: true
Restart K3s, and your snapshots will be automatically uploaded to S3. It's like having a digital safe deposit box for your cluster's soul.
Manual Backup Testing
Create a manual snapshot to test your backup system:
k3s etcd-snapshot save --name manual-backup-$(date +%Y%m%d-%H%M%S)
List your snapshots to verify:
k3s etcd-snapshot ls
Performance Monitoring and Alerting
Now that you have monitoring set up, you need to know when things go sideways before your users start complaining (or worse, before they stop complaining and just leave).
Setting Up Alerts in Grafana
Navigate to your Grafana instance and create alert rules for critical metrics:
- High CPU usage (>80% for 5 minutes)
- Memory usage approaching limits (>90%)
- Pod restart loops
- Node unavailability
Configure notification channels to alert via email, Slack, or whatever communication method ensures you'll actually see the alert.
Custom Dashboards
While the default dashboards are impressive, you'll want to create custom ones for your specific applications. Import community dashboards or create your own to monitor application-specific metrics.
Load Testing and Performance Validation
Before you declare victory, test your scaling setup with some load. Tools like Apache Bench or hey can simulate traffic:
# Install hey for load testing
go install github.com/rakyll/hey@latest
# Test your nginx deployment
kubectl port-forward svc/nginx-service 8080:80
hey -n 10000 -c 100 http://localhost:8080
Watch your HPA respond to the increased load by scaling up pods, then scale back down when the load subsides.
Putting It All Together
You now have a K3s cluster that can observe itself, scale automatically, upgrade gracefully, and recover from disasters. It's evolved from the simple single-node setup we created in Part 1 to a sophisticated, production-ready platform.
Your cluster now has:
- Comprehensive monitoring with Prometheus and Grafana
- Log aggregation with Loki
- Automatic pod scaling with HPA
- Automated upgrades with the system upgrade controller
- Reliable backups with etcd snapshots
The journey from "curl | sh" to a fully managed K3s cluster demonstrates the power of starting simple and adding complexity only when needed. Your cluster is now ready for whatever workloads you throw at it, and more importantly, it's ready to tell you when something goes wrong.
Remember, monitoring without action is just expensive data collection. Use these tools not just to admire pretty graphs, but to understand your applications' behavior and improve their reliability. Your future self (especially the one who gets called at 2 AM) will thank you for taking the time to set this up properly.
The best part? Everything we've covered builds on the foundation from Parts 1-4. Your single-node cluster from Part 1 can run this entire monitoring stack. The multi-node setup from Part 2 gives you redundancy. The configuration skills from Part 3 let you tune everything perfectly. And the application deployment knowledge from Part 4 makes this all worthwhile.
Now go forth and monitor responsibly. Your cluster is no longer just running; it's running with style, intelligence, and a comprehensive understanding of its own health. Welcome to the world of truly advanced K3s management.