rke2

Part 5 - RKE2 Zero To Hero: Advanced RKE2 Management - Monitoring, Scaling, and Upgrades

AB Engineering

23 Jun 2025 • 12 min read

Welcome to the final chapter of our RKE2: Zero to Hero journey! If you've been following along faithfully, you've built your first RKE2 cluster in Part 1, scaled it to a multi-node architecture in Part 2, mastered configuration wizardry in Part 3, and successfully deployed production applications in Part 4. Now comes the moment when you transform from someone who can deploy and configure RKE2 clusters to a true operations professional who can maintain, monitor, and evolve these systems in production environments.

This is where the rubber truly meets the road, where your cluster transitions from a carefully crafted digital creation to a living, breathing system that requires constant attention, optimization, and care. Think of it as the difference between building a race car and actually driving it in Formula 1 races. The technical challenges we'll tackle in this post will prepare you to handle the complexities of production Kubernetes environments, from comprehensive monitoring and logging to automated scaling and disaster recovery strategies.

By the end of this post, you'll have transformed from an RKE2 enthusiast into a seasoned cluster administrator capable of maintaining robust, scalable, and highly available Kubernetes environments that can withstand the demands of real-world production workloads.

Implementing Comprehensive Monitoring with Prometheus and Grafana

Modern Kubernetes operations are impossible without sophisticated monitoring capabilities. RKE2 provides excellent integration with industry-standard monitoring solutions, with Prometheus and Grafana forming the backbone of most production monitoring stacks. The combination of these tools provides real-time metrics collection, storage, and visualization that enables proactive cluster management.

Understanding the Monitoring Architecture

RKE2's monitoring architecture leverages the Prometheus Operator pattern, which simplifies the deployment and management of Prometheus instances through Kubernetes-native custom resources. The monitoring stack typically consists of Prometheus for metrics collection and storage, Grafana for visualization and dashboards, Alertmanager for alert routing and notification, and various exporters that collect metrics from different system components.

The rancher-monitoring application can be deployed quickly onto RKE2 clusters, providing a comprehensive monitoring solution powered by Prometheus, Grafana, Alertmanager, the Prometheus Operator, and the Prometheus adapter. This integrated approach ensures that all monitoring components work together seamlessly while following Kubernetes best practices.

Deploying the Monitoring Stack via Rancher

The most straightforward way to implement monitoring in RKE2 is through Rancher's built-in monitoring application. This approach leverages Helm charts and provides a curated set of monitoring components that are pre-configured for Kubernetes environments. The deployment process creates ServiceMonitors and PodMonitors that automatically discover and scrape metrics from cluster components.

To deploy monitoring through Rancher, navigate to your cluster in the Rancher UI and install the rancher-monitoring application from the Apps marketplace. This deployment includes default exporters such as node-exporter for hardware and OS metrics, windows-exporter for Windows hosts, and kube-state-metrics for Kubernetes API object metrics.

Manual Monitoring Stack Deployment

For environments that require more control over the monitoring stack configuration, manual deployment using Helm provides greater flexibility. The process begins with installing the Prometheus Operator, which can be accomplished using the official Helm chart from the prometheus-community repository.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-operator prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.enabled=true \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

This deployment creates a comprehensive monitoring stack with persistent storage for metrics retention and Grafana dashboards pre-configured for Kubernetes monitoring. The installation includes default ServiceMonitors that automatically discover and scrape metrics from cluster components.

Configuring Advanced Monitoring Features

Production environments often require customized monitoring configurations to meet specific operational requirements. RKE2's monitoring stack supports advanced features such as custom metrics collection, specialized dashboards, and integration with external monitoring systems.

For etcd monitoring specifically, which is crucial for cluster health, you can configure dedicated ServiceMonitors that scrape etcd metrics endpoints. This requires creating custom ServiceMonitor resources that target the etcd pods and expose critical metrics such as leader elections, data synchronization, and storage usage patterns.

Custom Grafana dashboards can be deployed as ConfigMaps, allowing teams to create specialized visualizations for their specific workloads and infrastructure components. These dashboards can include business-specific metrics alongside infrastructure metrics, providing comprehensive visibility into both system health and application performance.

Setting Up Centralized Logging with Modern Tools

Centralized logging is essential for troubleshooting, security analysis, and compliance in production Kubernetes environments. RKE2 supports multiple logging solutions, with Loki and Fluentd representing modern, cloud-native approaches to log aggregation and analysis.

Understanding RKE2 Log Architecture

RKE2 generates logs at multiple levels: container logs from running applications, kubelet logs from node operations, and system-level logs from RKE2 components themselves. By default, RKE2 stores container logs locally on each node, with kubelet managing log rotation based on configurable size and retention parameters.

The default kubelet configuration in RKE2 sets container log files to a maximum size of 10Mi with 5 files retained per container. These settings can be customized through the RKE2 configuration file by adding kubelet arguments for container-log-max-files and container-log-max-size.

Implementing Loki for Log Aggregation

Loki provides a modern approach to log aggregation that complements Prometheus-based monitoring architectures. Unlike traditional logging systems that index log content, Loki indexes only metadata labels, making it more efficient for cloud-native environments where log volume can be substantial.

The Fluentd output plugin for Loki, called fluent-plugin-grafana-loki, enables efficient log shipping from RKE2 clusters to Loki instances. This plugin supports both JSON and key-value log formats and uses protobuf compression to minimize network overhead during log transmission.

To deploy Loki in an RKE2 environment, you can use Helm charts that provide comprehensive configuration options:

helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --namespace logging \
  --create-namespace \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=100Gi \
  --set promtail.enabled=true \
  --set grafana.enabled=true

Configuring Fluentd for Advanced Log Processing

Fluentd provides powerful log processing capabilities that can transform, filter, and route logs based on complex criteria. In RKE2 environments, Fluentd can be deployed as a DaemonSet to ensure log collection from all cluster nodes.

The Fluentd configuration for Loki integration requires specific output plugin settings that handle authentication, compression, and label management. The plugin supports multi-worker configurations but requires careful handling of timestamp ordering to avoid insertion conflicts.

For production deployments, Fluentd configurations should include buffer management, retry logic, and monitoring integration to ensure reliable log delivery. The Docker image grafana/fluent-plugin-loki:main provides a pre-configured Fluentd setup that can be customized through environment variables.

Log Retention and Storage Management

Production logging systems require careful consideration of storage costs and retention policies. RKE2's container log rotation can be configured to balance disk usage with troubleshooting requirements, while centralized logging systems like Loki provide configurable retention periods for different log streams.

Storage management strategies should consider both local node storage for immediate troubleshooting and long-term centralized storage for compliance and historical analysis. Loki's label-based indexing allows for efficient queries across large time ranges while maintaining reasonable storage costs.

Configuring Auto-Scaling for Workloads and Infrastructure

Auto-scaling capabilities transform static Kubernetes deployments into dynamic systems that automatically adjust to changing demand patterns. RKE2 supports multiple auto-scaling approaches, from horizontal pod autoscaling for individual workloads to cluster-level node scaling for infrastructure capacity management.

Horizontal Pod Autoscaling (HPA)

HPA automatically adjusts the number of pod replicas based on observed CPU utilization, memory consumption, or custom metrics. RKE2's integration with the Prometheus adapter enables HPA to scale based on any metric collected by Prometheus, providing flexibility beyond basic resource utilization.

Implementing HPA requires proper resource requests and limits in pod specifications, as the autoscaler uses these values to calculate scaling decisions. The metrics server, included by default in RKE2, provides the necessary CPU and memory metrics for basic HPA functionality.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: webapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: webapp
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Vertical Pod Autoscaling (VPA)

VPA automatically adjusts resource requests and limits for running containers based on historical usage patterns. This approach optimizes resource allocation without changing the number of replicas, making it particularly useful for workloads with predictable but varying resource requirements.

While VPA is not included by default in RKE2, it can be installed as an additional component. VPA works by analyzing resource usage patterns and recommending or automatically applying resource limit adjustments to improve cluster efficiency.

Cluster-Level Auto-Scaling

Cluster auto-scaling automatically adds or removes nodes based on pod scheduling requirements and resource availability. RKE2 supports cluster auto-scaling through integration with cloud provider APIs and the Kubernetes cluster-autoscaler component.

The cluster-autoscaler for RKE2 environments can be configured to work with various cloud providers, including AWS EC2 Auto Scaling Groups, which provide the underlying infrastructure scaling capabilities. The autoscaler monitors for unschedulable pods and scales the cluster accordingly while respecting configured minimum and maximum node counts.

For VMware vSphere environments, cluster auto-scaling requires additional configuration and may not be officially supported by SUSE, requiring custom implementations and management. The complexity of vSphere integration often necessitates careful planning and testing before production deployment.

Event-Driven and Scheduled Scaling

Advanced scaling strategies include event-driven autoscaling using KEDA (Kubernetes Event Driven Autoscaler) and scheduled scaling for predictable demand patterns. KEDA enables scaling based on external metrics such as queue lengths, database connections, or custom application metrics.

Scheduled scaling using KEDA's Cron scaler allows workloads to scale proactively based on time patterns, such as reducing resource consumption during off-peak hours. This approach is particularly valuable for development and testing environments where usage patterns are predictable.

Performing Cluster Upgrades and Maintenance

Cluster upgrades represent one of the most critical operational tasks in Kubernetes environments, requiring careful planning and execution to maintain system availability. RKE2 provides multiple upgrade strategies ranging from manual procedures to fully automated approaches that minimize operational complexity.

Understanding RKE2 Upgrade Architecture

RKE2's upgrade process is designed to safely transition cluster components to new versions while maintaining cluster operation. During upgrades, RKE2 handles the proper shutdown and replacement of critical components including the Kubernetes control plane, etcd, and other system components.

The upgrade process uses a force restart mechanism to ensure clean transitions between versions, implemented through force restart files that signal the need for component cleanup. RKE2 carefully manages the removal of existing components before starting new ones through specialized functions that handle disabled pod cleanup.

Manual Upgrade Procedures

Manual upgrades provide maximum control over the upgrade process and serve as the foundation for automated approaches. The manual process involves updating RKE2 binaries on each node and restarting services in a controlled sequence.

For server nodes, which host the Kubernetes control plane components, the upgrade process requires special handling to maintain etcd consistency and API server availability. The upgrade sequence typically begins with a single server node, followed by remaining server nodes, and concludes with agent node upgrades.

Best practices for manual upgrades include taking etcd snapshots before beginning the upgrade process, verifying cluster health between each node upgrade, and maintaining communication with all stakeholders during the maintenance window. The ability to rollback to previous versions provides an important safety mechanism when unexpected issues arise.

Automated Upgrades with System Upgrade Controller

RKE2 supports Kubernetes-native automated upgrades using Rancher's system-upgrade-controller. This approach leverages custom resource definitions called "plans" that define upgrade policies and requirements for different node types.

The system-upgrade-controller runs as a pod on master nodes and monitors for upgrade plans, automatically scheduling upgrade jobs on nodes that match specified label selectors. The controller creates separate plans for server nodes and agent nodes, ensuring proper sequencing and coordination.

To deploy the system-upgrade-controller:

kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml

Upgrade plans are defined as custom resources that specify the target RKE2 version, node selection criteria, and upgrade concurrency limits. The controller ensures that upgrades follow Kubernetes version skew policies and provides status reporting throughout the upgrade process.

Upgrade Planning Best Practices

Successful cluster upgrades require comprehensive planning that includes compatibility testing, capacity planning, and rollback procedures. Teams should maintain dedicated testing environments that mirror production configurations for upgrade validation.

Pre-upgrade checks should include reviewing changelogs for Kubernetes, RKE2, and deployed applications to identify potential compatibility issues. Tools like kubent (kube-no-trouble) help identify deprecated APIs that might cause issues after upgrades. Resource capacity planning ensures sufficient cluster resources during rolling upgrade processes.

Upgrade strategies should consider pod disruption budgets, application-specific requirements, and business continuity needs. Staged rollouts can minimize risk by upgrading non-critical clusters first, while rollback procedures provide rapid recovery options when issues arise.

Implementing Backup and Disaster Recovery Strategies

Robust backup and disaster recovery capabilities are fundamental requirements for production Kubernetes environments. RKE2 provides comprehensive backup features focused on etcd snapshots, which contain the complete cluster state and configuration information necessary for disaster recovery scenarios.

etcd Backup Configuration and Automation

RKE2 includes built-in etcd snapshot capabilities that can be configured for both local and remote storage. By default, RKE2 stores snapshots in the local filesystem at /var/lib/rancher/rke2/server/db/snapshots/, but production environments typically require remote storage for disaster recovery purposes.

S3-compatible storage integration provides reliable remote backup capabilities. The configuration requires adding specific parameters to the RKE2 configuration file:

etcd-s3: true
etcd-s3-endpoint: "s3.amazonaws.com"
etcd-s3-bucket: "rke2-etcd-backups"
etcd-s3-region: "us-west-2"
etcd-s3-folder: "cluster-production"
etcd-snapshot-schedule-cron: "0 2 * * *"
etcd-snapshot-retention: 14

Automated snapshot scheduling ensures regular backups without manual intervention. The cron-based scheduling supports flexible backup frequencies, while retention policies manage storage costs by automatically removing older snapshots.

Monitoring Backup Health and Status

Backup monitoring ensures that snapshot processes execute successfully and backup data remains accessible. RKE2 stores backup metadata in Kubernetes ConfigMaps, allowing administrators to query backup status programmatically.

kubectl get configmap -n kube-system etcd-snapshots -o yaml

The ConfigMap contains detailed information about each snapshot, including creation timestamps, file sizes, storage locations, and completion status. This metadata enables monitoring systems to alert on backup failures or missing snapshots.

Integration with monitoring stacks allows backup health to be included in overall cluster health dashboards. Custom Prometheus metrics can be created to track backup age, size trends, and success/failure rates.

Disaster Recovery Procedures

Disaster recovery from etcd snapshots involves stopping all RKE2 services, restoring from a known-good snapshot, and restarting the cluster. The process requires careful coordination, especially in multi-node clusters where etcd data must be synchronized across all control plane nodes.

The recovery process begins on the first control plane node with a cluster reset operation that restores from a specified snapshot:

systemctl stop rke2-server
rke2 server --cluster-reset --cluster-reset-restore-path=/path/to/snapshot
systemctl start rke2-server

Remaining control plane nodes require etcd data directory cleanup before rejoining the restored cluster. This process removes inconsistent etcd data and allows nodes to sync from the restored cluster state.

High Availability and Multi-Region Strategies

Production environments often require high availability configurations that span multiple availability zones or regions. RKE2 supports HA deployments with odd numbers of control plane nodes distributed across failure domains.

Multi-region disaster recovery strategies involve maintaining synchronized backup copies across geographic regions and implementing automated failover procedures. These approaches require careful consideration of network latency, data consistency, and regulatory compliance requirements.

Recovery time objectives (RTO) and recovery point objectives (RPO) should guide disaster recovery planning, with backup frequency and restoration procedures designed to meet business requirements. Regular disaster recovery testing validates procedures and identifies potential issues before actual disasters occur.

Advanced Operational Monitoring and Health Checks

Beyond basic monitoring, production RKE2 environments require sophisticated health checking and observability capabilities that provide deep insights into cluster behavior and performance characteristics. These advanced monitoring approaches enable proactive issue detection and resolution before problems impact application availability.

etcd Health and Performance Monitoring

etcd health monitoring is crucial for cluster stability, as etcd serves as the primary data store for all Kubernetes state information. RKE2 provides tools for monitoring etcd health, including endpoint status checks and cluster-wide health validation.

Health check procedures can be executed directly on cluster nodes or through kubectl commands that access etcd containers. These checks provide insights into leader election status, database consistency, and performance metrics that indicate cluster health.

# Direct node access method
export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml
etcdcontainer=$(/var/lib/rancher/rke2/bin/crictl ps --label io.kubernetes.container.name=etcd --quiet)
/var/lib/rancher/rke2/bin/crictl exec $etcdcontainer sh -c "ETCDCTL_API=3 etcdctl endpoint health --cluster"

Custom Metrics and Application-Level Monitoring

Application-specific monitoring extends beyond infrastructure metrics to include business-relevant indicators such as transaction rates, error ratios, and user experience metrics. RKE2's Prometheus integration supports custom metrics collection through application instrumentation and specialized exporters.

ServiceMonitor and PodMonitor resources enable automatic discovery and scraping of application metrics endpoints. This approach scales naturally as new applications are deployed, ensuring comprehensive monitoring coverage without manual configuration updates.

Custom Grafana dashboards can combine infrastructure and application metrics to provide holistic views of system health. These dashboards should include both technical metrics for operations teams and business metrics for stakeholders.

Ingress and Network Monitoring

Network-level monitoring provides insights into traffic patterns, latency characteristics, and potential security issues. RKE2's NGINX Ingress controller can be monitored through specialized ServiceMonitor configurations that expose detailed request metrics.

Enabling ingress monitoring requires updating the rancher-monitoring chart configuration to include ingress-nginx monitoring. This configuration automatically deploys Grafana dashboards specifically designed for NGINX Ingress controller analysis.

The monitoring configuration should account for the ingress controller's deployment namespace, which in RKE2 is typically kube-system rather than a dedicated ingress namespace. This detail is important for proper metric collection and dashboard functionality.

Preparing for the Future: Your RKE2 Operations Journey

You've now completed an intensive journey through the world of RKE2, transforming from a newcomer taking first steps with single-node installations to a seasoned administrator capable of managing complex, production-ready Kubernetes environments. The skills you've developed throughout this series represent a comprehensive foundation for enterprise Kubernetes operations.

From Part 1's initial cluster deployment through Part 4's application management, you've built the knowledge necessary to deploy, configure, and operate RKE2 clusters with confidence. This final installment has equipped you with the advanced operational capabilities required to maintain these systems in production environments, including comprehensive monitoring, automated scaling, reliable backup strategies, and disaster recovery procedures.

The monitoring and observability capabilities you've implemented provide the visibility necessary to maintain healthy clusters proactively. Your understanding of auto-scaling mechanisms enables dynamic response to changing demand patterns, while your mastery of upgrade procedures ensures that clusters remain secure and up-to-date. The backup and disaster recovery strategies you've learned provide the safety net that production environments require.

These skills position you to tackle the ongoing challenges of Kubernetes operations, from capacity planning and performance optimization to security management and compliance requirements. The foundation you've built through this series provides the basis for continued learning and specialization in areas such as service mesh implementation, advanced security hardening, multi-cluster management, and cloud-native application development.

Your journey from RKE2 zero to hero is complete, but the world of Kubernetes continues to evolve rapidly. The operational excellence you've achieved through this series provides the stable foundation necessary to adapt to new technologies, embrace emerging best practices, and continue growing as a Kubernetes professional. Whether you're managing development environments or enterprise-scale production deployments, the comprehensive knowledge you've gained will serve as your guide through the complexities of modern container orchestration.