Skip to content
Go back

PostgreSQL + Linux: Kernel Tuning That Actually Matters

By SumGuy 6 min read
PostgreSQL + Linux: Kernel Tuning That Actually Matters

It was a Tuesday morning in February when an AWS engineer’s on-call alert fired. Their PostgreSQL cluster, running 1.2 million queries per day on a beefy EC2 instance, had just tanked. Throughput dropped 50%. Query latencies spiked. The database server itself looked fine—CPU wasn’t pinned, memory wasn’t exhausted, disk I/O was normal.

The commit log showed one change: they’d upgraded the host kernel to Linux 7.0 over the weekend.

Not a Postgres version bump. Not a query rewrite. The operating system changed beneath their database, and the database paid the price.

This isn’t some edge case. PostgreSQL’s entire architecture—multi-process model, heavy shared memory, constant context switching—lives and dies by Linux kernel tuning. Most Postgres performance problems aren’t Postgres problems at all. They’re OS problems wearing a database disguise.

Here’s what matters, and why your production database needs you to understand this stuff.

Why Postgres Is Sensitive to Kernel Behavior

PostgreSQL doesn’t use thread pools like MySQL or SQLite. Instead, each connection spawns its own process. That means your 200-connection cluster is 200 separate processes, all sharing a block of kernel memory and fighting for scheduler time.

When the kernel scheduler changes (like the EEVDF algorithm shift in Linux 7.0), Postgres’s context-switching overhead explodes. When memory management gets too aggressive, pages get swapped to disk mid-query. When the I/O scheduler batches requests wrong, SSDs suddenly behave like spinning rust.

The database can’t fix this. Only your kernel can.

Huge Pages: The Biggest Win

If you’re only going to tune one thing, tune this.

Huge pages let Postgres allocate memory in 2MB chunks instead of 4KB pages. This cuts the size of the kernel page table and dramatically reduces TLB (translation lookaside buffer) misses. For a 16GB shared_buffers, that’s the difference between walking through 4 million page table entries or 8,000 huge page entries.

Real-world result: 15–30% throughput improvement on heavy workloads.

Calculate Your Huge Pages

Terminal window
# Check current shared_buffers (from postgresql.conf)
sudo -u postgres psql -c "SHOW shared_buffers;"
# Calculate pages needed: (shared_buffers in bytes) / (2MB = 2097152)
# Example: 16GB shared_buffers = 16384 MB
# 16384 MB / 2 MB = 8192 huge pages

Then set it:

/etc/sysctl.d/99-postgres.conf
vm.nr_hugepages = 8192

And in postgresql.conf:

postgresql.conf
huge_pages = try

Postgres will use huge pages if available; fall back gracefully if not.

Shared Memory: The Foundation

Postgres stores its entire buffer pool, WAL buffers, and lock tables in shared memory. If your kernel won’t let it allocate enough, Postgres won’t even start.

Terminal window
# Check limits
sysctl kernel.shmmax kernel.shmall
# For 32GB total shared memory:
# shmmax should be at least 32GB in bytes
# shmall should be at least 32GB / 4096 pages

Set these large:

/etc/sysctl.d/99-postgres.conf
kernel.shmmax = 34359738368 # 32GB in bytes
kernel.shmall = 8388608 # 32GB / 4KB pages

PostgreSQL automatically calculates shared_buffers + max_connections * work_mem + overhead, but always give the kernel more than you think you’ll need. Postgres is conservative in its allocation; the kernel should be generous.

Kill Transparent Huge Pages (THP)

Transparent huge pages sound great. The kernel automatically promotes 4KB pages to 2MB pages without your asking. Free performance!

Except Postgres can’t predict when THP will kick in. When it does—especially during heavy query scans—the kernel has to stop Postgres processes to compact pages. Latency spikes 10–100ms on a quiet morning.

Disable THP entirely:

Terminal window
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Make it stick:

/etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="... transparent_hugepage=madvise"
sudo update-grub
sudo reboot

Memory Overcommit and Swappiness

PostgreSQL does not play well with the OOM killer. If the kernel starts swapping Postgres pages to disk mid-query, your database is now serving cold storage. Query time goes from 50ms to 5 seconds.

/etc/sysctl.d/99-postgres.conf
vm.overcommit_memory = 2 # No overcommit; OOM only if truly out of RAM
vm.overcommit_ratio = 100 # Use all available RAM
vm.swappiness = 5 # Don't swap unless desperate

On a dedicated database server, swap should be nearly zero. On a shared system, set it to 5–10. Never go above 20 for Postgres.

I/O Scheduler: SSDs Prefer None

The default I/O scheduler (bfq, cfq, mq-deadline) was designed for spinning disks. They batch requests to minimize seeking. SSDs have no seek time. Batching just adds latency.

For SSDs, use none (the null scheduler). For NVMe, mq-deadline is fine but none is better.

Check current:

Terminal window
cat /sys/block/nvme0n1/queue/scheduler

Set per-device:

Terminal window
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler

Make it permanent:

/etc/udev/rules.d/60-iosched.rules
ACTION=="add|change", KERNEL=="nvme0n1", ATTR{queue/scheduler}="none"

Network Tuning for Long Connections

PostgreSQL connections can be idle for seconds (think web apps). The kernel needs to know not to drop them.

/etc/sysctl.d/99-postgres.conf
net.core.somaxconn = 2048
net.ipv4.tcp_max_syn_backlog = 2048
# TCP keepalive for idle connections
net.ipv4.tcp_keepalives_intvl = 60
net.ipv4.tcp_keepalives_probes = 5

And in postgresql.conf:

postgresql.conf
tcp_keepalives_idle = 600
tcp_keepalives_interval = 30

The Linux 7.0 Scheduler Issue

The shift from CFS (Completely Fair Scheduler) to EEVDF broke Postgres’s multi-process balance. CFS treated all processes equally; EEVDF prioritizes by time slices, causing uneven load distribution across parallel queries.

If you hit this in late 2026, apply the kernel workaround or pin to Linux 6.x LTS until your Postgres version adds EEVDF awareness (planned for 18.0).

Complete /etc/sysctl.d/99-postgres.conf

/etc/sysctl.d/99-postgres.conf
# SumGuy's PostgreSQL Kernel Tuning
# Apply with: sudo sysctl -p /etc/sysctl.d/99-postgres.conf
# Huge pages (calculate based on your shared_buffers)
vm.nr_hugepages = 8192
# Shared memory (for 32GB systems)
kernel.shmmax = 34359738368
kernel.shmall = 8388608
# Memory management
vm.overcommit_memory = 2
vm.overcommit_ratio = 100
vm.swappiness = 5
# Network
net.core.somaxconn = 2048
net.ipv4.tcp_max_syn_backlog = 2048
net.ipv4.tcp_keepalives_intvl = 60
net.ipv4.tcp_keepalives_probes = 5
# Filesystem
fs.file-max = 2097152
fs.aio-max-nr = 1048576

Apply it:

Terminal window
sudo sysctl -p /etc/sysctl.d/99-postgres.conf

What This Fixes

Properly tuned, you’ll see:

The AWS engineer who hit the Linux 7.0 wall? These settings brought them back to baseline. The kernel had changed, but the database could adapt.

Your 2 AM self will thank you for getting this right before something breaks.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it may appear here.


Next Post
Rootless Docker: Run Without Root

Related Posts