Incident: OOM Killer

The Out-of-Memory (OOM) Killer is a Linux kernel mechanism that terminates processes when the system runs out of memory. It does not ask. It does not warn. A process that was running fine is simply gone, and if you are not watching the right logs, you will spend a long time wondering why your application disappeared.

What It Looks Like

The application process vanishes with no error in its own logs. This is the defining symptom. The application did not crash — it was killed. Its logs will be clean right up until the moment it stopped existing.

In system logs (/var/log/messages, journalctl -k):

kernel: Out of memory: Kill process 14313 (java) score 875 or sacrifice child
kernel: Killed process 14313 (java) total-vm:8392456kB, anon-rss:7891232kB
kernel: Out of memory: Killed process 14313 (java) total-vm:8392456kB, anon-rss:7891232kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:15432kB oom_score_adj:0

In CloudWatch Logs (if you are forwarding system logs): Look for Out of memory or oom-kill in your system log group.

The process exits with signal 9 (SIGKILL):

# If your process manager logged the exit code
systemd: myapp.service: Main process exited, code=killed, status=9/KILL

Immediate Diagnosis

Step 1 — Confirm the OOM kill happened

# Check kernel ring buffer
dmesg | grep -i "oom\|killed process\|out of memory"

# Check system journal
journalctl -k | grep -i "oom\|killed process"

# On older systems
grep -i "oom\|killed" /var/log/messages | tail -50

Step 2 — See current memory state

free -h

              total        used        free      shared  buff/cache   available
Mem:           7.6G        7.4G        42M       128M       890M       312M
Swap:            0B          0B          0B

available is the key column — not free. Available includes memory that can be reclaimed from cache. If available is near zero, the system is under memory pressure.

# More detail — what is consuming memory
ps aux --sort=-%mem | head -20

# Or use smem for a more accurate picture (includes shared memory correctly)
smem -rs pss | head -20

Step 3 — Understand why the OOM Killer chose its victim

The kernel assigns an oom_score to every process. Higher score = more likely to be killed. The score is based primarily on memory consumption, with adjustments via oom_score_adj.

# Check the oom_score of your processes
for pid in $(pgrep -f myapp); do
  echo "PID: $pid, OOM Score: $(cat /proc/$pid/oom_score)"
done

# Or for all processes
ps -eo pid,comm,rss | awk '{print $1, $2, $3}' | while read pid comm rss; do
  score=$(cat /proc/$pid/oom_score 2>/dev/null)
  echo "$score $pid $comm $rss"
done | sort -rn | head -20

Why the OOM Killer Chose Your Application

The OOM Killer is not random. It uses a scoring algorithm that factors in:

RSS (Resident Set Size) — how much physical memory the process is using
Process age — newer processes score higher (more likely to be killed)
oom_score_adj — a tunable value from -1000 to +1000 that biases the score

A Java process with a 4GB heap on a 4GB instance will almost always be the OOM Killer's first target. This is expected behaviour — but it means the root cause is almost always over-allocation or a memory leak, not the OOM Killer itself.

Recovery

Immediate — restart the killed service:

sudo systemctl start myapp
# Or if systemd has restart configured, it may already be back up
systemctl status myapp

Check if systemd is configured to restart on kill:

# /etc/systemd/system/myapp.service
[Service]
Restart=always
RestartSec=5s

If Restart=always is set, systemd will restart the process automatically after an OOM kill. This is not a fix — it is a safety net. Without it, your application simply disappears until someone notices.

Root Cause Patterns

Pattern 1 — Memory leak The application's memory usage grows over time without bound. Common in:

Java applications with incorrect heap sizing or unbounded caches
Python applications holding large data structures in memory
Node.js applications with EventEmitter listener leaks

Identify by watching RSS over time:

# Watch a specific process's memory every 5 seconds
watch -n 5 'ps -p $(pgrep -f myapp) -o pid,rss,vsz,comm'

Pattern 2 — JVM heap misconfiguration The JVM defaults to using up to 25% of physical memory for heap. On a 4GB instance that is 1GB. If your application needs 3GB, you will be OOM killed regularly.

# Incorrect — JVM will use default heap and may be killed
java -jar myapp.jar

# Correct — set heap explicitly
java -Xms2g -Xmx3g -jar myapp.jar

But: do not set -Xmx to the total instance memory. The JVM also needs memory for the metaspace, thread stacks, and native code. On a 4GB instance, -Xmx3g is the practical maximum.

Pattern 3 — Overcommit without swap Linux allows memory overcommit — processes can be allocated more virtual memory than exists physically. When the physical memory is actually needed and none is available, the OOM Killer fires. AWS instances typically have no swap by default.

Add swap as a safety buffer (not a solution, but it buys time):

# Create a 2GB swapfile
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Make it permanent
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

Pattern 4 — Too many processes on an undersized instance You are running Prometheus + your application + a build agent on a 2GB instance. One of them will be killed. The fix is either to upsize the instance or move workloads apart.

Prevention

Protect critical processes from the OOM Killer:

# Set oom_score_adj to -1000 to make a process effectively immune
# (only root can set negative values)
echo -1000 > /proc/$(pgrep -f myapp)/oom_score_adj

# Or in a systemd unit file (persistent across restarts)
[Service]
OOMScoreAdjust=-900

Use this carefully. Making your application OOM-immune means the kernel will kill something else instead — potentially a more important system process.

Make less critical processes more likely to be killed first:

# Score of +1000 = kill this first
echo 1000 > /proc/$(pgrep -f worker)/oom_score_adj

Set memory limits on containers (ECS / Docker):

If you are running containers without memory limits, a single container can consume all available memory and trigger OOM kills on other containers.

{
  "containerDefinitions": [{
    "name": "myapp",
    "memory": 2048,
    "memoryReservation": 1024
  }]
}

memory is the hard limit — the container is killed if it exceeds this. memoryReservation is the soft reservation used for scheduling.

Alert before memory is exhausted:

CloudWatch metric: mem_used_percent (requires CloudWatch agent)

Alert thresholds:

75% — investigate
85% — action required before the next traffic spike causes an OOM event

✦ Test Your Knowledge

1.Your application vanishes with no error in its own logs. What is the most likely cause?

AThe application panicked and suppressed the error

BA network timeout killed the process

CThe Linux OOM Killer terminated the process — it does not write to the application's logs

DThe application ran out of file descriptors

2.Which command confirms that an OOM kill occurred on the system?

Aps aux | grep killed

Bdmesg | grep -i 'oom\|killed process\|out of memory'

Cjournalctl -u myapp | grep killed

Dtop | grep memory

3.What does the `available` column in `free -h` represent?

AThe amount of completely unused RAM

BThe total RAM minus swap

CMemory that can be freed including reclaimable cache — the real measure of memory pressure

DThe amount of RAM used by the kernel

4.How does the OOM Killer decide which process to terminate?

AIt always kills the most recently started process

BIt kills the process using the most CPU

CIt uses an oom_score based primarily on memory consumption, adjusted by oom_score_adj

DIt kills processes randomly to free memory quickly

5.You want to protect a critical database process from being OOM killed. What is the correct approach?

AGive the process more CPU priority with nice

BSet OOMScoreAdjust=-900 in the systemd unit file to make it a low-priority kill target

CRun the process as root

DIncrease the process's memory limit

6.A JVM application is repeatedly OOM killed on a 4GB instance. What is the likely root cause and fix?

AJava is incompatible with Linux — migrate to a different language

BThe JVM is using default heap sizing and consuming too much memory — set explicit heap with -Xms and -Xmx

CThe instance needs to be rebooted to clear memory

DDisable the OOM Killer entirely with vm.oom-kill = 0

What It Looks Like​

Immediate Diagnosis​

Why the OOM Killer Chose Your Application​

Recovery​

Root Cause Patterns​

Prevention​