Ulimit, Cgroups, and the Art of Stopping Processes From Eating Your Server

The 3am Incident You’ve Either Had or Will Have

It goes like this: you wake up to alerts. Or you don’t wake up, and you find out in the morning. Either way, the server is unresponsive, the OOM killer has been running amok through your process table, and one service — you know which one — has consumed every available megabyte of RAM and taken everything else with it.

Node.js is a frequent offender. So is Java with no heap limit set. Elasticsearch out of the box will attempt to consume everything it can touch. A misconfigured backup job reading an unexpectedly large dataset. A memory leak in a long-running service that nobody caught in development.

The frustrating part: Linux has had the tools to prevent this for decades. ulimit for per-process limits. Cgroups for hierarchical resource control. Systemd for integrating both into service management. These tools are present, documented, and largely ignored until something goes wrong.

ulimit: Per-Process Resource Limits

ulimit is a shell builtin that sets resource limits for the current shell and all processes it spawns. These limits are enforced by the kernel.

# See all current limits
ulimit -a

# Common limits:
ulimit -n   # Open file descriptors
ulimit -u   # Max user processes
ulimit -m   # Max memory size (in KB)
ulimit -v   # Virtual memory (in KB)
ulimit -s   # Stack size
ulimit -t   # CPU time (seconds)

Hard vs. Soft Limits

Every limit has two values:

Soft limit: The current active limit. A process can raise this up to the hard limit.
Hard limit: The ceiling. Only root can raise the hard limit.

# See soft limits
ulimit -S -a

# See hard limits
ulimit -H -a

# Set soft limit for file descriptors
ulimit -Sn 65536

# Set both to same value
ulimit -n 65536

Why File Descriptors Matter So Much

The default open file descriptor limit (nofile) is often 1024 on older systems. For a busy web server or database, this is laughably small. Each network connection is a file descriptor. A server handling 5,000 concurrent connections needs at least 5,000 file descriptors, plus headroom for actual files.

# Current limits for running process (PID 1234)
cat /proc/1234/limits

# See current fd usage
ls -la /proc/1234/fd | wc -l

# System-wide: see open fd count
cat /proc/sys/fs/file-nr
# Returns: [currently-open] [0] [max-allowed]

/etc/security/limits.conf: Persistent Limits

Shell ulimit commands don’t persist across reboots and don’t apply to services. For persistent user/process limits:

# Format: <domain> <type> <item> <value>

# Specific user - high file descriptor limit
myapp-user    soft    nofile    65536
myapp-user    hard    nofile    65536

# Specific user - memory limit (in KB, so 4GB)
myapp-user    soft    as        4194304
myapp-user    hard    as        4194304

# All users - max processes
*             soft    nproc     1024
*             hard    nproc     2048

# Wildcard for group
@developers   soft    nofile    32768

Drop-in files in /etc/security/limits.d/ are also processed (and won’t get overwritten by package upgrades):

myapp-user    soft    nofile    65536
myapp-user    hard    nofile    65536
myapp-user    soft    nproc     512
myapp-user    hard    nproc     512

Important: limits.conf is processed by PAM’s pam_limits module, which means it applies to login sessions. It does not automatically apply to systemd services — those need separate configuration.

systemd Service Limits

For services managed by systemd, use directives in the unit file:

[Service]
ExecStart=/usr/bin/myapp
User=myapp-user

# File descriptor limit
LimitNOFILE=65536

# Max processes
LimitNPROC=512

# Core dump size (0 = no core dumps)
LimitCORE=0

# Max memory locked into RAM
LimitMEMLOCK=512M

Available Limit* directives mirror ulimit options. After editing:

systemctl daemon-reload
systemctl restart myapp

# Verify the limits
cat /proc/$(systemctl show -p MainPID myapp | cut -d= -f2)/limits

Cgroups v2: The Modern Approach

While ulimit limits individual processes, cgroups (control groups) provide hierarchical resource management — you can assign groups of processes to cgroups and set collective limits on CPU, memory, I/O, and more.

Linux 4.5+ supports cgroups v2 (the unified hierarchy). Most modern distributions use it by default:

# Check if cgroups v2 is active
mount | grep cgroup2
# cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

# Or check kernel version (v2 is default on kernel 5.x+)
cat /proc/filesystems | grep cgroup

The Unified Hierarchy

In cgroups v1, different resources were in separate subsystem trees (memory controller, CPU controller, etc.) — a process could be in different groups for different resources. cgroups v2 uses a single unified hierarchy. All controllers apply to the same group.

# The cgroup filesystem
ls /sys/fs/cgroup/

# Your shell's cgroup
cat /proc/self/cgroup
# 0::/user.slice/user-1000.slice/session-1.scope

CPU Limits

# Create a cgroup
mkdir /sys/fs/cgroup/myapp

# Set CPU weight (100-10000, default 100)
echo 50 > /sys/fs/cgroup/myapp/cpu.weight

# Set CPU quota (in microseconds per period)
# 200000 us of CPU time per 1000000 us period = 20% of one CPU
echo "200000 1000000" > /sys/fs/cgroup/myapp/cpu.max

# Move a process into the cgroup
echo $PID > /sys/fs/cgroup/myapp/cgroup.procs

Memory Limits

# Hard memory limit: kills process if exceeded
echo 512M > /sys/fs/cgroup/myapp/memory.max

# Soft limit: memory reclamation suggestion
echo 256M > /sys/fs/cgroup/myapp/memory.high

# See memory usage
cat /sys/fs/cgroup/myapp/memory.current

# Memory events (OOM kills, etc.)
cat /sys/fs/cgroup/myapp/memory.events

systemd Slices, Scopes, and Service Resource Control

Directly manipulating cgroup files is low-level. Systemd abstracts this cleanly.

Systemd’s Cgroup Hierarchy

system.slice          ← All system services
├── ssh.service
├── nginx.service
└── myapp.service

user.slice            ← All user sessions
└── user-1000.slice
    └── session-1.scope

machine.slice         ← VMs and containers

Setting Resource Limits in Service Units

[Service]
ExecStart=/usr/bin/myapp
User=myapp-user

# CPU
CPUQuota=20%            # Max 20% of one CPU
CPUWeight=50            # Relative weight (default 100)

# Memory
MemoryMax=512M          # Hard limit — OOM kill if exceeded
MemoryHigh=384M         # Soft limit — throttle above this
MemorySwapMax=0         # No swap for this service

# IO
IOWeight=50             # Relative IO weight
IOReadBandwidthMax=/dev/sda 50M    # 50MB/s read max
IOWriteBandwidthMax=/dev/sda 25M   # 25MB/s write max

# Tasks (processes + threads)
TasksMax=256

Creating Custom Slices

For grouping related services:

[Unit]
Description=Web Application Services
Before=slices.target

[Slice]
CPUQuota=60%
MemoryMax=2G

[Unit]
Description=My App
After=network.target

[Service]
Slice=webapp.slice    # Put this service in the webapp slice
ExecStart=/usr/bin/myapp

Now all services in webapp.slice collectively can’t use more than 60% CPU or 2GB RAM.

# View resource usage by slice
systemd-cgtop

# Detailed cgroup information
systemctl status webapp.slice

Docker and Cgroups

Docker uses cgroups under the hood for container resource limits. When you set memory or CPU limits in Docker, it creates cgroup entries:

# Run container with resource limits
docker run \
  --memory="512m" \
  --memory-swap="512m" \
  --cpus="0.5" \
  --cpu-shares=512 \
  nginx

# See the container's cgroup
docker inspect CONTAINER_ID | jq ".[0].HostConfig | {Memory, CpuShares, NanoCpus}"

# Find the cgroup directly
CGROUP_PATH=$(docker inspect CONTAINER_ID --format "{{.Id}}")
cat /sys/fs/cgroup/system.slice/docker-${CGROUP_PATH}.scope/memory.max

In docker-compose.yml:

services:
  webapp:
    image: myapp
    deploy:
      resources:
        limits:
          cpus: "0.50"
          memory: 512M
        reservations:
          cpus: "0.25"
          memory: 256M

Note: deploy.resources in docker-compose only applies when using Docker Swarm mode. For single-host compose, use mem_limit and cpus:

services:
  webapp:
    image: myapp
    mem_limit: 512m
    cpus: 0.5

Practical Example: Taming Node.js

[Unit]
Description=My Node.js Application
After=network.target

[Service]
Type=simple
User=nodejs
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/node /opt/myapp/server.js

# Node.js specific: set heap limit before systemd would kick in
Environment="NODE_OPTIONS=--max-old-space-size=1024"

# systemd resource limits
MemoryMax=1536M      # 1.5GB hard limit
MemoryHigh=1024M     # Throttle at 1GB
CPUQuota=50%
TasksMax=256

# Restart on failure, but not if it fails too fast
Restart=on-failure
RestartSec=10s
StartLimitIntervalSec=60s
StartLimitBurst=3

# Security hardening (bonus)
NoNewPrivileges=true
ProtectSystem=strict
PrivateTmp=true

[Install]
WantedBy=multi-user.target

This ensures Node.js can’t eat more than 1.5GB of RAM. If it tries, it gets OOM-killed and restarted. The NODE_OPTIONS heap limit gives Node.js its own garbage collection target before the kernel kills it, which is friendlier.

Monitoring Resource Limits in Action

# Real-time cgroup resource usage
systemd-cgtop -d 2    # Update every 2 seconds

# Specific service
systemctl status myapp.service

# Memory events for a cgroup (OOM kills, throttling)
journalctl -u myapp.service | grep -i "memory\|oom\|killed"

# Kernel OOM killer messages
dmesg | grep -i "oom\|killed process"

# Current memory usage
cat /sys/fs/cgroup/system.slice/myapp.service/memory.current
cat /sys/fs/cgroup/system.slice/myapp.service/memory.events

Resource limits are the infrastructure equivalent of putting a fence around something that’s caused trouble before. You don’t need to perfectly understand cgroups v2’s unified hierarchy on day one — start with systemd’s MemoryMax and CPUQuota in your service units. That covers 80% of the problem with 20% of the complexity. Add ulimit configuration in limits.conf for non-systemd processes. Revisit slices when you want to group multiple services under collective limits. Save the raw cgroup manipulation for when you’re curious or when systemd doesn’t expose what you need.

Your 3am incidents don’t need to be dramatic. They can just quietly hit a limit, restart, and get logged. That’s better than taking the whole server down.