Incident: OOM Killer
The Out-of-Memory (OOM) Killer is a Linux kernel mechanism that terminates processes when the system runs out of memory. It does not ask. It does not warn. A process that was running fine is simply gone, and if you are not watching the right logs, you will spend a long time wondering why your application disappeared.
What It Looks Like
The application process vanishes with no error in its own logs. This is the defining symptom. The application did not crash — it was killed. Its logs will be clean right up until the moment it stopped existing.
In system logs (/var/log/messages, journalctl -k):
kernel: Out of memory: Kill process 14313 (java) score 875 or sacrifice child
kernel: Killed process 14313 (java) total-vm:8392456kB, anon-rss:7891232kB
kernel: Out of memory: Killed process 14313 (java) total-vm:8392456kB, anon-rss:7891232kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:15432kB oom_score_adj:0
In CloudWatch Logs (if you are forwarding system logs):
Look for Out of memory or oom-kill in your system log group.
The process exits with signal 9 (SIGKILL):
# If your process manager logged the exit code
systemd: myapp.service: Main process exited, code=killed, status=9/KILL
Immediate Diagnosis
Step 1 — Confirm the OOM kill happened
# Check kernel ring buffer
dmesg | grep -i "oom\|killed process\|out of memory"
# Check system journal
journalctl -k | grep -i "oom\|killed process"
# On older systems
grep -i "oom\|killed" /var/log/messages | tail -50
Step 2 — See current memory state
free -h
total used free shared buff/cache available
Mem: 7.6G 7.4G 42M 128M 890M 312M
Swap: 0B 0B 0B
available is the key column — not free. Available includes memory that can be reclaimed from cache. If available is near zero, the system is under memory pressure.
# More detail — what is consuming memory
ps aux --sort=-%mem | head -20
# Or use smem for a more accurate picture (includes shared memory correctly)
smem -rs pss | head -20
Step 3 — Understand why the OOM Killer chose its victim
The kernel assigns an oom_score to every process. Higher score = more likely to be killed. The score is based primarily on memory consumption, with adjustments via oom_score_adj.
# Check the oom_score of your processes
for pid in $(pgrep -f myapp); do
echo "PID: $pid, OOM Score: $(cat /proc/$pid/oom_score)"
done
# Or for all processes
ps -eo pid,comm,rss | awk '{print $1, $2, $3}' | while read pid comm rss; do
score=$(cat /proc/$pid/oom_score 2>/dev/null)
echo "$score $pid $comm $rss"
done | sort -rn | head -20
Why the OOM Killer Chose Your Application
The OOM Killer is not random. It uses a scoring algorithm that factors in:
- RSS (Resident Set Size) — how much physical memory the process is using
- Process age — newer processes score higher (more likely to be killed)
- oom_score_adj — a tunable value from -1000 to +1000 that biases the score
A Java process with a 4GB heap on a 4GB instance will almost always be the OOM Killer's first target. This is expected behaviour — but it means the root cause is almost always over-allocation or a memory leak, not the OOM Killer itself.
Recovery
Immediate — restart the killed service:
sudo systemctl start myapp
# Or if systemd has restart configured, it may already be back up
systemctl status myapp
Check if systemd is configured to restart on kill:
# /etc/systemd/system/myapp.service
[Service]
Restart=always
RestartSec=5s
If Restart=always is set, systemd will restart the process automatically after an OOM kill. This is not a fix — it is a safety net. Without it, your application simply disappears until someone notices.
Root Cause Patterns
Pattern 1 — Memory leak The application's memory usage grows over time without bound. Common in:
- Java applications with incorrect heap sizing or unbounded caches
- Python applications holding large data structures in memory
- Node.js applications with EventEmitter listener leaks
Identify by watching RSS over time:
# Watch a specific process's memory every 5 seconds
watch -n 5 'ps -p $(pgrep -f myapp) -o pid,rss,vsz,comm'
Pattern 2 — JVM heap misconfiguration The JVM defaults to using up to 25% of physical memory for heap. On a 4GB instance that is 1GB. If your application needs 3GB, you will be OOM killed regularly.
# Incorrect — JVM will use default heap and may be killed
java -jar myapp.jar
# Correct — set heap explicitly
java -Xms2g -Xmx3g -jar myapp.jar
But: do not set -Xmx to the total instance memory. The JVM also needs memory for the metaspace, thread stacks, and native code. On a 4GB instance, -Xmx3g is the practical maximum.
Pattern 3 — Overcommit without swap Linux allows memory overcommit — processes can be allocated more virtual memory than exists physically. When the physical memory is actually needed and none is available, the OOM Killer fires. AWS instances typically have no swap by default.
Add swap as a safety buffer (not a solution, but it buys time):
# Create a 2GB swapfile
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Make it permanent
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
Pattern 4 — Too many processes on an undersized instance You are running Prometheus + your application + a build agent on a 2GB instance. One of them will be killed. The fix is either to upsize the instance or move workloads apart.
Prevention
Protect critical processes from the OOM Killer:
# Set oom_score_adj to -1000 to make a process effectively immune
# (only root can set negative values)
echo -1000 > /proc/$(pgrep -f myapp)/oom_score_adj
# Or in a systemd unit file (persistent across restarts)
[Service]
OOMScoreAdjust=-900
Use this carefully. Making your application OOM-immune means the kernel will kill something else instead — potentially a more important system process.
Make less critical processes more likely to be killed first:
# Score of +1000 = kill this first
echo 1000 > /proc/$(pgrep -f worker)/oom_score_adj
Set memory limits on containers (ECS / Docker):
If you are running containers without memory limits, a single container can consume all available memory and trigger OOM kills on other containers.
{
"containerDefinitions": [{
"name": "myapp",
"memory": 2048,
"memoryReservation": 1024
}]
}
memory is the hard limit — the container is killed if it exceeds this. memoryReservation is the soft reservation used for scheduling.
Alert before memory is exhausted:
CloudWatch metric: mem_used_percent (requires CloudWatch agent)
Alert thresholds:
- 75% — investigate
- 85% — action required before the next traffic spike causes an OOM event
1.Your application vanishes with no error in its own logs. What is the most likely cause?
2.Which command confirms that an OOM kill occurred on the system?
3.What does the `available` column in `free -h` represent?
4.How does the OOM Killer decide which process to terminate?
5.You want to protect a critical database process from being OOM killed. What is the correct approach?
6.A JVM application is repeatedly OOM killed on a 4GB instance. What is the likely root cause and fix?