Systemd: Zero to Hero – Part 4: Diagnosing Failures and Debugging Like a Pro

Systemd: Zero to Hero – Part 4: Diagnosing Failures and Debugging Like a Pro

Welcome back to our systemd journey! If you've been following along from our foundational introduction in Part 1, mastered service management in Part 2, and conquered unit file creation in Part 3, you're now ready to tackle the inevitable: when things go wrong. As any seasoned system administrator will tell you, systemd services don't always behave as expected, and when they don't, you need the right tools and techniques to diagnose and fix the problems quickly.

In this fourth installment, we'll dive deep into systemd's diagnostic and troubleshooting capabilities. We'll explore the powerful journalctl command for log analysis, master the art of interpreting systemctl status output, and learn to use systemd-analyze for performance profiling and dependency analysis. By the end of this post, you'll be equipped to handle everything from simple service failures to complex boot issues and dependency loops.

The Anatomy of a systemd Failure

Before we dive into the tools, it's important to understand what constitutes a "failure" in systemd terms . Services can fail in numerous ways: they might exit with non-zero status codes, crash due to signals, timeout during startup or shutdown, or fail to meet their dependencies . Understanding these failure modes is crucial because each requires a different diagnostic approach.

Systemd tracks service states meticulously, and a failed service will remain in the "failed" state until explicitly reset . This persistent failure state is actually a feature, not a bug, as it allows administrators to investigate what went wrong even after the fact.

Mastering journalctl: Your Log Analysis Swiss Army Knife

The journalctl command is systemd's primary interface to the journal, a centralized logging system that collects everything from kernel messages to application output . Unlike traditional log files scattered across /var/log, the systemd journal stores data in a structured, binary format that enables powerful querying and filtering capabilities .

Basic journalctl Usage

Let's start with the fundamentals. Running journalctl without arguments displays all journal entries, starting with the oldest :

# View all journal entries
journalctl

# View logs from the current boot session
journalctl -b

The -b option is particularly useful because it filters logs to only show entries from the current boot session . You can also specify previous boots with -b -1 (previous boot), -b -2 (two boots ago), and so on.

Service-Specific Log Analysis

When troubleshooting a specific service, the -u flag is your best friend :

# View logs for a specific service
journalctl -u nginx.service

# Follow logs in real-time (like tail -f)
journalctl -u nginx.service -f

# Show only the last 50 lines
journalctl -u nginx.service -n 50

Here's a real-world example of analyzing a failed web server:

$ journalctl -u httpd.service --since "1 hour ago"
Jun 24 10:30:15 webserver systemd: Starting The Apache HTTP Server...
Jun 24 10:30:15 webserver httpd[1234]: (98)Address already in use: AH00072: make_sock: could not bind to address [::]:80
Jun 24 10:30:15 webserver httpd[1234]: (98)Address already in use: AH00072: make_sock: could not bind to address 0.0.0.0:80
Jun 24 10:30:15 webserver httpd[1234]: no listening sockets available, shutting down
Jun 24 10:30:15 webserver systemd: httpd.service: Main process exited, code=exited, status=1/FAILURE
Jun 24 10:30:15 webserver systemd: httpd.service: Failed with result 'exit-code'.
Jun 24 10:30:15 webserver systemd: Failed to start The Apache HTTP Server.

This log clearly shows that Apache failed to start because port 80 was already in use, a common issue that would be much harder to diagnose without centralized logging.

Advanced Filtering Techniques

The journal's structured nature allows for sophisticated filtering . Each log entry contains metadata fields that you can query directly:

# Show all error-level messages from the current boot
journalctl -b -p err

# Filter by process ID
journalctl _PID=1234

# Filter by executable
journalctl /usr/bin/sshd

# Show all logs from a specific time range
journalctl --since "2024-06-24 09:00:00" --until "2024-06-24 10:00:00"

For debugging complex issues, you can combine multiple filters :

# Show error messages from systemd itself during boot
journalctl -b -p err _COMM=systemd

# Show all messages from services that failed
journalctl -b _SYSTEMD_UNIT=*.service -p err

One of the most powerful features is the ability to grep within journal messages using the --grep or -g option (available in systemd 247+) :

# Search for specific text in log messages
journalctl -u myapp.service --grep "database connection failed"

# Case-insensitive search with regex
journalctl --grep "(?i)memory.*error"

Understanding Log Priority Levels

The journal uses standard syslog priority levels :

  • 0 (emerg): System is unusable
  • 1 (alert): Action must be taken immediately
  • 2 (crit): Critical conditions
  • 3 (err): Error conditions
  • 4 (warning): Warning conditions
  • 5 (notice): Normal but significant condition
  • 6 (info): Informational messages
  • 7 (debug): Debug-level messages

When you specify -p err, you get all messages at error level and above (critical, alert, emergency) . This is invaluable for quickly identifying serious problems.

Decoding systemctl status: Reading Service Health at a Glance

While journalctl gives you detailed logs, systemctl status provides a comprehensive overview of a service's current state . Let's dissect what this command tells us:

$ systemctl status postgresql.service
● postgresql.service - PostgreSQL database server
   Loaded: loaded (/usr/lib/systemd/system/postgresql.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2024-06-24 09:15:23 UTC; 2h 45min ago
     Docs: man:postgres(1)
 Main PID: 1847 (postgres)
    Tasks: 8 (limit: 4915)
   Memory: 23.5M
   CGroup: /system.slice/postgresql.service
           ├─1847 /usr/bin/postgres -D /var/lib/pgsql/data
           ├─1849 postgres: checkpointer
           ├─1850 postgres: background writer
           ├─1851 postgres: walwriter
           ├─1852 postgres: autovacuum launcher
           ├─1853 postgres: archiver
           ├─1854 postgres: stats collector
           └─1855 postgres: logical replication launcher

Jun 24 09:15:23 dbserver systemd: Starting PostgreSQL database server...
Jun 24 09:15:23 dbserver postgres[1847]: LOG:  redirecting log output to logging collector process
Jun 24 09:15:23 dbserver systemd: Started PostgreSQL database server.

Let's break down each section:

Loaded line: Shows where the unit file is located, whether it's enabled to start at boot, and the vendor preset . The "enabled" status means this service will start automatically on boot.

Active line: The most critical information! This shows the current state (active/inactive/failed), the sub-state (running/dead/failed), when it started, and how long it's been running .

Main PID: The process ID of the main service process . This is crucial for tracking down issues and connecting log entries to the right process.

Tasks and Memory: Resource usage information, helpful for identifying resource exhaustion issues .

CGroup: Shows all processes belonging to this service, invaluable for understanding what's actually running .

Recent logs: The last few log entries, giving you immediate context without needing to run journalctl separately.

Interpreting Failure States

When a service fails, systemctl status becomes your first diagnostic tool :

$ systemctl status myapp.service
● myapp.service - My Custom Application
   Loaded: loaded (/etc/systemd/system/myapp.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2024-06-24 11:30:45 UTC; 5min ago
  Process: 2847 (code=exited, status=203/EXEC)
 Main PID: 2847 (code=exited, status=203/EXEC)

Jun 24 11:30:45 server systemd: Starting My Custom Application...
Jun 24 11:30:45 server systemd: myapp.service: Main process exited, code=exited, status=203/EXEC
Jun 24 11:30:45 server systemd: myapp.service: Failed with result 'exit-code'.
Jun 24 11:30:45 server systemd: Failed to start My Custom Application.

The exit code 203/EXEC indicates that systemd couldn't execute the specified command, often because the executable doesn't exist or isn't marked as executable . Exit codes in the 200-242 range are set by systemd itself, while 0-255 are from the application.

Using systemctl for Dependency Analysis

Services often fail due to dependency issues. The systemctl command provides several ways to analyze dependencies :

# Show what services depend on this unit
systemctl list-dependencies myapp.service

# Show reverse dependencies (what depends on this service)
systemctl list-dependencies --reverse myapp.service

# Show only failed units
systemctl --failed

Here's an example of a dependency failure:

$ systemctl status myapp.service
● myapp.service - My Database Application
   Loaded: loaded (/etc/systemd/system/myapp.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit-hit) since Wed 2024-06-24 12:00:00 UTC; 2min ago
  Process: 3456 (code=exited, status=1/FAILURE)
 Main PID: 3456 (code=exited, status=1/FAILURE)

Jun 24 12:00:00 server systemd: myapp.service: Scheduled restart job, restart counter is at 5.
Jun 24 12:00:00 server systemd: Stopped My Database Application.
Jun 24 12:00:00 server systemd: myapp.service: Start request repeated too quickly.
Jun 24 12:00:00 server systemd: myapp.service: Failed with result 'start-limit-hit'.
Jun 24 12:00:00 server systemd: Failed to start My Database Application.

The "start-limit-hit" result indicates that the service failed repeatedly and systemd stopped trying to restart it . This often happens when a service has a dependency that isn't available.

systemd-analyze: Performance Profiling and Boot Analysis

The systemd-analyze command is your go-to tool for understanding system performance and boot behavior . It provides insights that are crucial for optimization and troubleshooting slow boots or service startup issues.

Boot Time Analysis

The most basic use of systemd-analyze shows overall boot performance :

$ systemd-analyze
Startup finished in 2.347s (firmware) + 1.892s (loader) + 1.456s (kernel) + 3.201s (initrd) + 8.394s (userspace) = 17.290s
graphical.target reached after 8.394s in userspace

This breakdown helps identify where boot time is being spent. If userspace takes unusually long, you likely have a service startup issue .

Identifying Slow Services

The blame command shows which services took the longest to start :

$ systemd-analyze blame
     2.234s postgresql.service
     1.892s NetworkManager.service
     1.456s systemd-udev-settle.service
     0.987s firewalld.service
     0.654s httpd.service
     0.432s sshd.service

While this output is useful, remember that services starting in parallel doesn't necessarily mean the slowest service is the bottleneck. For that, you need the critical-chain analysis .

Critical Path Analysis

The critical-chain command shows the dependency chain that determines total boot time :

$ systemd-analyze critical-chain
The time after the unit is active or started is printed after the "@" character.
The time the unit takes to start is printed after the "+" character.

graphical.target @8.394s
└─multi-user.target @8.393s
  └─postgresql.service @6.159s +2.234s
    └─network-online.target @6.158s
      └─NetworkManager-wait-online.service @4.266s +1.892s
        └─NetworkManager.service @2.374s +1.892s
          └─dbus.service @2.373s
            └─basic.target @2.372s

This shows that PostgreSQL was the bottleneck, taking 2.234 seconds to start and blocking the critical path to the graphical target .

Visualizing Boot Process

For complex boot issues, the plot command generates an SVG timeline :

# Generate boot timeline visualization
systemd-analyze plot > boot.svg

This creates a graphical timeline showing all services, their dependencies, and startup times. It's particularly useful for understanding complex dependency relationships and identifying unexpected delays.

Verifying Unit Files

The verify command checks unit files for syntax errors and common misconfigurations :

# Verify a specific unit file
systemd-analyze verify /etc/systemd/system/myapp.service

# Verify all unit files
systemd-analyze verify

This is invaluable during development and can catch issues before they cause runtime failures.

Debugging Dependency Loops and Ordering Cycles

One of the most frustrating systemd issues is dependency loops, where services have circular dependencies . When this happens, systemd will break the loop arbitrarily and continue booting, but the results are unpredictable.

Identifying Dependency Loops

Dependency loops typically manifest as warnings in the journal :

$ journalctl -b | grep -i "breaking.*cycle"
Jun 24 10:15:23 server systemd: Breaking ordering cycle by deleting job myapp.service/start
Jun 24 10:15:23 server systemd: Job myapp.service/start deleted to break ordering cycle starting with myapp.service/start

When you see these messages, you have a dependency loop that needs investigation .

Analyzing Service Dependencies

To debug dependency loops, start by examining the problematic service's dependencies :

# Show all dependencies
systemctl show myapp.service | grep -E "(Requires|Wants|After|Before)"

# Visualize dependencies
systemd-analyze dot myapp.service | dot -Tsvg > myapp-deps.svg

The dot command creates a dependency graph that can reveal circular relationships that aren't obvious from the unit files alone .

Resolving Common Dependency Issues

Most dependency loops occur due to incorrect use of After and Before directives . Here's a common problematic pattern:

# Service A
[Unit]
After=serviceB.service

# Service B
[Unit]  
After=serviceA.service

The fix is usually to remove one of the ordering dependencies or restructure the services to eliminate the circular relationship .

Recovery Techniques and Emergency Debugging

Sometimes systemd issues are so severe that the system won't boot properly. Knowing how to recover from these situations is crucial for any system administrator.

Emergency and Rescue Targets

Systemd provides special targets for emergency situations :

Emergency Target: Provides minimal environment with root filesystem mounted read-only . Access by adding systemd.unit=emergency.target to kernel command line.

Rescue Target: More complete environment with local filesystems mounted and basic services started . Access with systemd.unit=rescue.target.

Both targets require the root password and provide a shell for recovery operations .

Boot Debugging with Early Debug Shell

For boot issues, enable the debug shell to get early access :

# Enable debug shell (requires reboot)
systemctl enable debug-shell.service

# Access debug shell on TTY9 (Ctrl+Alt+F9)
# Remember to disable after debugging!
systemctl disable debug-shell.service

The debug shell provides an unauthenticated root shell early in the boot process, so disable it immediately after use .

Capturing Boot State Information

When services fail mysteriously during boot but work fine after login, you need to capture the boot-time environment . Create a debug service that runs before your problematic service:

[Unit]
Description=Boot State Capture
Before=myapp.service
DefaultDependencies=no

[Service]
Type=oneshot
ExecStart=/bin/bash -c 'date > /tmp/boot-debug.log; mount >> /tmp/boot-debug.log; ps aux >> /tmp/boot-debug.log'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

This captures critical system state information that you can analyze after boot .

Core Dump Analysis with systemd-coredump

When services crash due to segmentation faults or other fatal signals, systemd-coredump automatically captures core dumps for analysis . This is invaluable for debugging application crashes.

Viewing Available Core Dumps

Use coredumpctl to manage core dumps :

# List all core dumps
coredumpctl list

# Show details of a specific dump
coredumpctl info PID

# Start GDB session with a core dump
coredumpctl debug PID

Core Dump Storage and Cleanup

Core dumps are stored in /var/lib/systemd/coredump/ and automatically cleaned up after 3 days . For sensitive applications, you can configure systemd to log crashes without storing the actual core dump.

Advanced Log Analysis Techniques

Structured Logging and Metadata

The systemd journal stores more than just text messages . Each entry includes extensive metadata that enables powerful analysis:

# Show all available metadata fields
journalctl -N

# Show all values for a specific field
journalctl -F _SYSTEMD_UNIT

# Query by multiple metadata fields
journalctl _SYSTEMD_UNIT=sshd.service _PID=1234

Time-Based Analysis

For performance analysis and incident response, precise time filtering is crucial :

# Show logs from exactly one hour ago
journalctl --since "1 hour ago"

# Show logs from a specific time window  
journalctl --since "2024-06-24 10:00:00" --until "2024-06-24 11:00:00"

# Show logs from yesterday
journalctl --since yesterday --until today

Monitoring Logs in Real-Time

For troubleshooting active issues, real-time log monitoring is essential :

# Follow all logs in real-time
journalctl -f

# Follow specific service with context
journalctl -u myapp.service -f --lines=50

# Follow only error messages
journalctl -f -p err

Performance Monitoring and Resource Analysis

Memory and Resource Tracking

Systemd tracks resource usage for each service, which is invaluable for debugging performance issues :

# Show current resource usage
systemctl status myapp.service

# Show detailed resource information
systemctl show myapp.service | grep -E "(Memory|Tasks|CPU)"

Watchdog and Restart Policies

For services that occasionally hang or crash, proper restart configuration is crucial :

[Service]
Restart=on-failure
RestartSec=5
StartLimitIntervalSec=300  
StartLimitBurst=5
WatchdogSec=30

This configuration restarts the service on failure, but only up to 5 times in 300 seconds, and enables a 30-second watchdog timer . If your application supports it, implementing watchdog notifications can prevent unnecessary restarts due to legitimate long-running operations .

Troubleshooting Socket Activation

Socket activation is a powerful systemd feature, but it can be tricky to debug when it doesn't work as expected . Here's how to test socket activation manually:

# Test socket activation for a service
systemd-socket-activate -l /tmp/test.sock --inetd /usr/bin/myapp

# Check socket status
systemctl status myapp.socket

# Monitor socket activation
journalctl -u myapp.socket -f

Socket activation failures often occur due to permission issues, incorrect socket configuration, or applications that don't properly handle socket activation .

Mount Point and Filesystem Debugging

Failed mounts can cause cascade failures in dependent services . Use these techniques to identify and resolve mount issues:

# List failed mount units
systemctl --failed --type=mount

# Check mount dependencies
systemctl list-dependencies local-fs.target

# Retry failed mounts
systemctl daemon-reload
systemctl restart local-fs.target

For encrypted filesystems that weren't available during boot, failed mount units often need manual restart after the underlying filesystem becomes available .

Tips for Effective Debugging

Environment and Configuration Issues

Many service failures are due to environment differences between interactive and systemd execution . Common issues include:

  • Missing environment variables
  • Different PATH settings
  • Permission problems
  • Working directory assumptions

Use systemctl show to examine the actual runtime environment:

systemctl show myapp.service | grep -E "(Environment|ExecStart|User|Group|WorkingDirectory)"

Exit Codes and Signal Handling

Understanding exit codes is crucial for debugging . Systemd uses exit codes 200-242 for its own errors:

  • 200: EXIT_CHDIR (couldn't change directory)
  • 201: EXIT_NICE (couldn't set nice level)
  • 202: EXIT_FDS (couldn't set up file descriptors)
  • 203: EXIT_EXEC (couldn't execute command)

Application exit codes (0-255) follow standard conventions, but check your application's documentation for specific meanings.

Logging Best Practices

For your own services, implement structured logging that works well with the systemd journal :

# Use logger for shell scripts
logger -t myapp "Database connection successful"

# Use systemd-specific log levels in applications
echo "Critical error occurred" | systemd-cat -t myapp -p crit

This ensures your logs integrate seamlessly with systemd's logging infrastructure and are easily filtered and analyzed.

Looking Ahead

We've covered the essential debugging and diagnostic techniques that will serve you well in most systemd troubleshooting scenarios. The combination of journalctl for detailed log analysis, systemctl status for quick health checks, and systemd-analyze for performance insights gives you a powerful toolkit for maintaining healthy systems.

In our final part of this series, Part 5, we'll explore systemd's advanced features including socket activation, transient units, security sandboxing, and resource management with cgroups. These advanced capabilities can help prevent many of the issues we've discussed here by building more robust and secure services from the ground up.

Remember, effective systemd debugging is as much about prevention as it is about problem-solving. Well-designed unit files with appropriate dependencies, timeouts, and restart policies will save you countless hours of troubleshooting. And when problems do occur, the systematic approach we've outlined here will help you identify and resolve issues quickly and efficiently.

The journey from systemd novice to expert requires practice, but with these diagnostic tools in your arsenal, you're well-equipped to handle whatever systemd throws at you. Whether it's a simple service startup failure or a complex boot-time dependency issue, you now have the knowledge and techniques to debug like a pro.