Random Writes & Flushes: Your mileage will vary greatly
The differences in sequential write rates are interesting to note, but for most users they won’t make for as notable a difference in overall performance as random writes.
What’s a long time for a random write? Well, an average HDD can typically move 4 KB random writes to its spinning media in 7 to 15 milliseconds, which has proven to be largely unacceptable. As a result, most HDDs come with 4, 8 or more megabytes of internal memory and attempt to cache small random writes rather than wait the full 7 to 15 milliseconds. When they do cache a write, they return success to the OS even though the bytes haven’t been moved to the spinning media. We typically see these cached writes completing in a few hundred microseconds (so 10X, 20X or faster than actually writing to spinning media). In looking at millions of disk writes from thousands of telemetry traces, we observe 92% of 4 KB or smaller IOs taking less than 1 millisecond, 80% taking less than 600 microseconds, and an impressive 48% taking less than 200 microseconds. Caching works!
On occasion, we’ll see HDDs struggle with bursts of random writes and flushes. Drives that cache too much for too long and then get caught with too much of a backlog of work to complete when a flush comes along, have proven to be problematic. These flushes and surrounding IOs can have considerably lengthened response times. We’ve seen some devices take a half second to a full second to complete individual IOs and take 10’s of seconds to return to a more consistently responsive state. For the user, this can be awful to endure as responsiveness drops to painful levels. Think of it, the response time for a single I/O can range from 200 microseconds up to a whopping 1,000,000 microseconds (1 second).
When presented with realistic workloads, we see the worst of the SSDs producing very long IO times as well, as much as one half to one full second to complete individual random write and flush requests. This is abysmal for many workloads and can make the entire system feel choppy, unresponsive and sluggish.Random Writes & Flushes: Why is this so hard?
For many, the notion that a purely electronic SSD can have more trouble with random writes than a traditional HDD seems hard to comprehend at first. After all, SSDs don’t need to seek and position a disk head above a track on a rotating disk, so why would random writes present such a daunting a challenge?
The answer to this takes quite a bit of explaining, Anand’s article
admirably covers many of the details. We highly encourage motivated folks to take the time to read it as well as this fine USENIX paper
. In an attempt to avoid covering too much of the same material, we’ll just make a handful of points.
Most SSDs are comprised of flash cells (either SLC
). It is possible to build SSDs out of DRAM. These can be extremely fast, but also very costly and power hungry. Since these are relatively rare, we’ll focus our discussion on the much more popular NAND flash based SSDs. Future SSDs may take advantage of other nonvolatile memory technologies than flash.
A flash cell is really a trap, a trap for electrons and electrons don’t like to be trapped. Consider this, if placing 100 electrons in a flash cell constitutes a bit value of 0, and fewer means the value is 1, then the controller logic may have to consider 80 to 120 as the acceptable range for a bit value of 0. A range is necessary because some electrons may escape the trap, others may fall into the trap when attempting to fill nearby cells, etc… As a result, some very sophisticated error correction logic is needed to insure data integrity.
Flash chips tend to be organized in complex arrangements, such as blocks, dies, planes and packages. The size, arrangement, parallelism, wear, interconnects and transfer speed characteristics of which can and do vary greatly.
Flash cells need to be erased before they can be written. You simply can’t trust that a flash cell has no residual electrons in it before use, so cells need to be erased before filling with electrons. Erasing is done on a large scale. You don’t erase a cell; rather you erase a large block of cells (like 128 KB worth). Erase times are typically long -- a millisecond or more.
Flash wears out. At some point, a flash cell simply stops working as a trap for electrons. If frequently updated data (e.g., a file system log file) was always stored in the same cells, those cells would wear out more quickly than cells containing read-mostly data. Wear leveling logic is employed by flash controller firmware to spread out writes across a device’s full set of cells. If done properly, most devices will last years under normal desktop/laptop workloads.
It takes some pretty clever device physicists and some solid engineering to trap electrons at high speed, to do so without errors, and to keep the devices from wearing out unevenly. Not all SSD manufacturers are as far along as others in figuring out how to do this well.