Skip to main content

Incident: Disk Full

Disk full is one of the most common production incidents and one of the most predictable. It is almost never a surprise if you are watching the right metrics. This guide covers what it actually looks like when it happens, the recovery steps, and the common traps that turn a simple incident into a longer outage.

What It Looks Like

The symptoms depend on what hit full first.

Application errors:

No space left on device
ENOSPC: no space left on device
ERROR: could not write to file: No space left on device

Database won't start / crashes:

FATAL: could not write to file "pg_wal/...": No space left on device

Writes silently fail — some applications catch ENOSPC and fail silently, logging nothing. The symptom is data that should have been written not appearing, or an application returning success but the file not existing.

SSH still works but shell is broken — you can log in but commands that write output fail. vim won't save. Logs won't write. The system feels corrupted but it is not.


Immediate Diagnosis

Run these in order. Do not start deleting things yet.

Step 1 — Confirm the problem and identify which filesystem

df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1 40G 40G 0 100% /
/dev/xvdf 20G 4.2G 15G 22% /data
tmpfs 1.9G 0 1.9G 0% /dev/shm

Note which mount point is full. It is not always /. Logs often go to /var, databases to /data or /var/lib.

Step 2 — Find what is using the space

# Top-level breakdown — find the biggest directories
du -sh /* 2>/dev/null | sort -rh | head -20

# Drill into the biggest offender
du -sh /var/* 2>/dev/null | sort -rh | head -10
du -sh /var/log/* 2>/dev/null | sort -rh | head -10

Step 3 — Check for deleted files still held open

This is the trap that catches most engineers. A process can delete a file but if another process has it open, the disk space is not freed until that process closes or is restarted. The file is gone from the directory listing but the inode is still allocated.

lsof +L1

This lists open file descriptors where the link count is 0 — deleted files still held open. If you see a 10GB log file listed here, restarting the process that holds it will free the space immediately.

COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NLINK NODE NAME
nginx 1234 root 10w REG 202,1 10737418 0 12345 /var/log/nginx/access.log (deleted)

Step 4 — Check inode exhaustion

You can run out of inodes before you run out of disk space. This happens with workloads that create millions of small files (mail queues, PHP session files, build artifacts).

df -i
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/xvda1 2621440 2621440 0 100% /

If IUse% is 100% but Use% is 70%, you have an inode problem, not a capacity problem. Find the directory with millions of files:

find / -xdev -printf '%h\n' 2>/dev/null | sort | uniq -c | sort -rn | head -20

Recovery

The safe deletions — do these first:

# Journal logs (can safely clear old entries)
sudo journalctl --vacuum-time=2d

# Clear old systemd journal
sudo journalctl --vacuum-size=100M

# Package manager cache
sudo apt clean # Debian/Ubuntu
sudo dnf clean all # RHEL/Fedora

# Docker — this is often the biggest offender
docker system prune -f # removes stopped containers, unused images
docker system prune -af --volumes # more aggressive — removes everything unused

If the culprit is application logs:

# Truncate a log file without stopping the process (safe — preserves the file descriptor)
> /var/log/application/app.log

# Or use truncate
truncate -s 0 /var/log/application/app.log

Do NOT rm the log file if the application has it open — the space won't be freed and the application may crash when it tries to write. Truncate instead.

If it is a database write-ahead log (Postgres WAL):

Do not delete WAL files manually. This will corrupt your database. The correct approach:

# Connect to postgres and checkpoint to flush WAL
psql -U postgres -c "CHECKPOINT;"

# If replication slots are holding WAL — check for inactive slots
psql -U postgres -c "SELECT slot_name, active, restart_lsn FROM pg_replication_slots;"

# Drop an inactive replication slot that is holding WAL
psql -U postgres -c "SELECT pg_drop_replication_slot('slot_name');"

Restart processes holding deleted files open:

Once you have identified processes with lsof +L1:

sudo systemctl restart nginx
sudo systemctl restart application-name

After Recovery — Extend the Volume

If recovery freed enough space temporarily, fix the root cause properly.

On AWS — extend EBS volume without downtime:

# 1. In AWS console or CLI, modify the volume size
aws ec2 modify-volume --volume-id vol-xxxxxxxx --size 80

# 2. On the instance, grow the partition (check your device name with lsblk)
sudo growpart /dev/xvda 1

# 3. Extend the filesystem (XFS)
sudo xfs_growfs /

# 3. Extend the filesystem (ext4)
sudo resize2fs /dev/xvda1

No reboot required on modern kernels (4.x+).


Why This Keeps Happening

Disk full incidents in production usually have one of four root causes:

1. No log rotation configured Application logs grow without bound. logrotate is installed on most Linux systems but application logs are often not configured in it.

# Check what is being rotated
ls /etc/logrotate.d/

# Add your application
cat > /etc/logrotate.d/myapp << EOF
/var/log/myapp/*.log {
daily
rotate 14
compress
missingok
notifempty
sharedscripts
postrotate
systemctl reload myapp
endscript
}
EOF

2. Docker not pruned Docker accumulates stopped containers, dangling images, and unused volumes. On a busy build machine this can consume dozens of gigabytes per week. Schedule docker system prune as a cron job.

3. No alerting on disk usage By the time a human notices the disk is full, the application has already been failing. Alert at 75% and 85%. At 85% you should be taking action, not waiting for 100%.

4. tmpfs filling on containerised workloads Containers write to /tmp which maps to a tmpfs mount. Some workloads generate large temporary files that never get cleaned up.


The Monitoring Fix

On CloudWatch:

# Check disk metrics are being sent (requires CloudWatch agent)
cat /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json

Alert thresholds worth setting:

  • 75% — informational, investigate at next opportunity
  • 85% — action required, do not leave for tomorrow
  • 95% — incident, wake someone up
✦ Test Your Knowledge

1.You run `df -h` and see / is at 100%. What should you do BEFORE deleting anything?

AImmediately run `rm -rf /var/log/*` to free space
BRun `du -sh /* | sort -rh` to find what is consuming space, and `lsof +L1` to check for deleted files still held open
CReboot the server to clear temporary files
DExtend the EBS volume immediately

2.A process deleted a large log file but `df -h` still shows the disk is full. What is the cause?

AThe file was not deleted correctly
BThe filesystem needs to be remounted
CAnother process still has the file open — the inode is still allocated until that process closes or restarts
DThe disk cache needs to be cleared

3.What command lists deleted files that are still held open by a running process?

Als -la /proc/*/fd
Blsof +L1
Cfind / -name '*.deleted'
Ddf -i

4.Your application log file is open by a running process and consuming 8GB. What is the safe way to free that space without stopping the app?

Arm /var/log/application/app.log
Bkill -9 $(pgrep myapp)
Ctruncate -s 0 /var/log/application/app.log
Dmv /var/log/application/app.log /tmp/

5.`df -h` shows 70% disk usage but writes are failing with 'No space left on device'. What should you check?

ACheck if the filesystem is read-only
BRun `df -i` to check inode exhaustion — you can run out of inodes before disk capacity
CRestart the application
DCheck network connectivity

6.What are the recommended disk usage alert thresholds for production?

AAlert only at 100% — earlier alerts create noise
BAlert at 50% and 75%
CAlert at 75% (investigate) and 85% (action required)
DAlert at 90% and 95%