Why PostgreSQL Never Loses Your Data
Picture this right. You’re running a critical batch job that updates say, some 100,000 user records. Halfway through, your server crashes. Without PostgreSQL’s Write-Ahead Log (WAL), you’d be in some deep ****. You’d have no idea which records were actually saved and which were lost. Your database would be in an inconsistent state, and you’d be left guessing what happened.
But here’s the thing… PostgreSQL actually never loses your data. Even when everything crashes. And it’s not some magic trick. It’s a dead simple system called WAL that I got fascinated with after watching Hussein Nasser’s deep dive into PostgreSQL’s internal architecture.
What WAL Solves
So WAL is Write-Ahead Logging, and it tackles this age-old problem in databases: what happens when your database crashes mid-transaction?
Without WAL, you’re looking at a scenario like this: you start a transaction, PostgreSQL writes all your changes to memory, you commit the transaction, PostgreSQL tries to flush all those pages to disk, and then crash happens mid-flush. Some pages got written, others didn’t. Your database is inconsistent, and there’s no way to know which records were actually committed. WAL solves this with a simple but brilliant rule. Any changes to data files must be written only after those changes have been logged. Think of it as keeping a receipt before you spend the money. You can always reconstruct what happened.
Get this then. When you execute INSERT INTO users VALUES (1, 'john'), PostgreSQL doesn’t immediately write the tuple to the data file. Instead, it creates a WAL record with a 24-byte header containing the LSN (Log Sequence Number), transaction ID, and record length, followed by the actual data. This record gets written to the WAL buffer, and only then does PostgreSQL update the data page in shared buffers. The WAL buffer is flushed to disk via fsync() before the transaction commits, ensuring durability. The data pages can stay dirty in memory because PostgreSQL only cares that the WAL record is safely on disk. The LSN is essentially a byte offset into the WAL stream. Think of it as a precise timestamp that tells PostgreSQL exactly where each change happened. This is how PostgreSQL knows exactly where to resume replay from after a crash.
How PostgreSQL Recovers
PostgreSQL doesn’t just write WAL records forever. Instead, it periodically creates checkpoints. Recovery points that say “at this point, all data pages are consistent and up-to-date.” When a checkpoint runs, all dirty pages get flushed to disk, the WAL is flushed, and a checkpoint record is written to the WAL containing the checkpoint LSN. This position is saved in the pg_control file, creating a recovery point.
Checkpoints are triggered by three conditions: time (default every 5 minutes), WAL segments filled (default every 16 segments), or manual checkpoint commands. The checkpoint process uses a background writer that flushes dirty pages in batches to avoid blocking user queries.
Imagine you’re playing a video game and the power goes out. When you restart, the game doesn’t make you start from the very beginning. Instead, it finds your last save point and replays everything you did since then. That’s exactly what PostgreSQL does with WAL. The checkpoint is your save point, and WAL records are your actions since that save point.
Doesn’t Need Undo Logs?
The cool part of WAL is how it handles tricky problems that other databases struggle with. Most databases use undo logs to record old states for MVCC (Multi-Version Concurrency Control). PostgreSQL doesn’t need them because of its tuple-based MVCC design.
PostgreSQL stores multiple versions of the same row in the same table, each with a transaction ID range indicating when it’s visible. Old tuples just hang around until vacuum cleans them up. Even if you flush records from an uncommitted transaction, it’s fine. Other transactions will simply ignore those tuples when they look them up because their transaction IDs fall outside the visibility window.
This is what makes PostgreSQL’s WAL different from other databases. In MySQL’s InnoDB, for example, you have both redo logs (for durability) and undo logs (for MVCC). PostgreSQL’s tuple-based MVCC eliminates the need for undo logs entirely. The WAL only needs to handle durability, not concurrency control.
WAL Actually Makes Things Faster
You’d think logging would make things slower, but it actually makes PostgreSQL faster, not slower.
The PostgreSQL documentation explains that WAL results in “a significantly reduced number of disk writes, because only the WAL file needs to be flushed to disk to guarantee that a transaction is committed, rather than every data file changed by the transaction.” Modern SSDs can handle sequential writes at 500MB/s+ but random writes at maybe 50MB/s. A transaction touching 100 different pages only needs one fsync() call to the WAL, not 100 calls to different data files. This is especially powerful for systems handling many small transactions across different parts of the database.
On a flip side, WAL performance depends heavily on your hardware setup. If you’re running WAL on the same disk as your data files, you’re creating a bottleneck. The solution is simple: put WAL on a separate, fast disk. Preferably an NVMe SSD with high write endurance.
Where WAL Can Bite You
WAL isn’t magic. It comes with real trade-offs that can bite you if you’re not careful. If you set checkpoint intervals too low, you’ll get “checkpoint storms” where PostgreSQL suddenly flushes tons of dirty pages, causing I/O spikes that can freeze your database for seconds. Set them too high, and recovery takes forever. It’s a delicate balance.
Understanding WAL helps you make better decisions about PostgreSQL configuration. Larger WAL segments (wal_segment_size, default 16MB) can improve performance but use more disk space. More frequent checkpoints mean faster recovery but more I/O overhead. WAL archiving is essential for point-in-time recovery and streaming replication. Put WAL on a separate, fast disk. Why? Because you’re going to be writing to it a lot.
The key configuration parameters you need to understand checkpoint_timeout controls how often checkpoints run (default 5 minutes), max_wal_size controls how much WAL can accumulate before forcing a checkpoint (default 1GB), wal_level controls how much information is written to WAL (minimal, replica, or logical), and synchronous_commit controls whether transactions wait for WAL to be flushed (on/off). These parameters directly affect both performance and durability.
Why This Matters?
Reliability. Durability. Performance.
You can’t deny that the engineers behind PostgreSQL are amazing. At it’s core, it’s a simple idea that has been refined over decades to be one of the most reliable and performant databases in the world. It’s a reminder that the best systems aren’t just functional. They’re beautiful in their simplicity and comprehensive in their problem-solving.