Alok Sinha | DevOps EngineerAlok Sinha | DevOps Engineer
Alok Sinha | DevOps Engineer
  • Home
  • About
  • Skills
  • Blog
  • Contact
  • Have any Questions?

    me@aloksinha.in
Alok Sinha | DevOps Engineer

Debugging Linux Processes Like an SRE Pro

  • Alok Sinha
  • February 6, 2025

Linux Process Stopped Suddenly? Here’s How you Debug Like an SRE Pro! 🚨

Have you ever faced a situation where a critical Linux process performing computations and writing to disk just… stopped? As an AWS DevOps & SRE expert, I’ve encountered this in production systems. Troubleshooting quickly is crucial. Here’s my step-by-step approach to diagnose and resolve such incidents:

πŸ” Step 1: Is the Process Still Running?

  • Check if it crashed:
  • ps aux | grep process_name
  • pgrep -fl process_name – Double-check memory presence.
  • Look for system messages:
  • dmesg -T | tail -50 – Check for segmentation faults or OOM kills.

πŸ’‘Ask: Why did it stop if missing?

πŸ” Step 2: Check System Logs for Clues

  • Inspect recent logs:
  • journalctl -xe --no-pager -n 50
  • tail -f /var/log/syslog

πŸ’‘Check for: SIGKILL or SIGTERM – was it manually stopped or system-killed?

πŸ” Step 3: CPU & Memory – Was It Overloaded?

  • Analyze resource usage:
  • top -o %CPU
  • top -o %MEM
  • dmesg | grep -i "oom" – Termination by OOM Killer?

πŸ’‘Consider: Scaling up or optimizing if killed due to resource exhaustion.

πŸ” Step 4: Disk Issues – Is It Full or Too Slow?

  • Examine disk status:
  • df -h
  • iostat -xm 1
  • dmesg | grep -i "error"

πŸ’‘Decide on: Cleaning up, expanding storage, or optimizing writes.

πŸ” Step 5: Are There Locked or Deleted Files?

  • Identify file issues:
  • lsof -p <PID>
  • lsof | grep -iE "deleted|locked"

πŸ’‘Determine if: Locks need releasing or dependent processes need restarting.

πŸ” Step 6: Was It Killed by an External Source?

  • Check for external kills:
  • journalctl -u process_name --no-pager -n 50
  • lastcomm | grep process_name

πŸ’‘Assess: Intentional stop or monitoring tool misfire.

πŸ” Step 7: Real-Time Debugging – What’s It Doing?

  • Inspect live process state:
  • strace -p <PID>
  • gdb -p <PID>

πŸ’‘Decide: Whether to restart, reconfigure, or investigate further if hung.

πŸ”₯ Final Thoughts – Why This Matters!

As an SRE & Cloud Expert, ensuring high availability, reliability, and observability is critical. Efficiently debugging failures is key, whether on AWS, Kubernetes, or high-performance computing workloads.

Alok%20Sinha%20|%20DevOps%20Engineer

Alok Sinha

I am a DevOps Engineer with over 5 years of experience. I am passionate about helping digital organizations deliver better software, faster. With a strong background in various technology roles, I focus on automating processes and fostering collaboration between development and IT teams.

Share:

Previus Post
Understanding Package.json
Next Post
Understanding HTTP

Leave a comment

Cancel reply

Recent Post

  • 08 April, 2025What is Load Balancing?
  • 27 February, 2025Understanding HTTP Methods: A Simple Guide
  • 06 February, 2025Debugging Linux Processes Like an SRE

category list

  • DevOps (15)
  • Tech Tips & Tutorials (12)
  • Technology (11)