Compression and deduplication are mature, which is a nice way of saying their evolution is basically done.
It is easy to forget about this due to the low per-gigabyte cost of storage. But that doesn’t mean you should always buy more capacity rather than using what you already have in a smarter arrangement.
On an individual drive level, how full is too full? “It totally depends on use case,” explained Gleb Budman, CEO of storage vendor Backblaze, which is known for its open design and drive reliability reports used by many companies seeking DIY storage-area networks.
“The original reason virtualization was developed was because people were not using all of the processing power of their servers. With drives, hypothetically you have the same thing. Why would you buy a second drive if you have extra space on any of them? You’d want to fill them up to the very last byte. Realistically, there are issues with that. If you fill them up completely, the drive becomes unusable,” Budman noted.
However, that school of thought is based on traditional hard drives—spinning disks, where data isn’t written in the same order as the tracks and sectors—so it doesn’t really apply any longer for solid-state drives, which do not have moving parts.
“With SSDs it’s not an issue,” Budman said. “We haven’t seen anything about, ‘You shouldn’t have your drive more than X percent full’.”
Compression and deduplication are the most mainstream ways of fitting more data onto drives. But the former is just about at the end of its evolution, and the latter has become a standard feature, so neither gives a company much of an edge anymore.
Is compression dead? Budman said that’s essentially true, because researchers today need more processing resources than they gain back in compression percentages. Worthwhile leaps forward in compression percentages are in the past, he said. “In the global impact, I think ‘compression is dead’ is a fine framing of what’s happening,” he concurred.
A huge leap forward in compression, announced by independent researcher David Fifield on July 2, 2019, came with the catch that it’s purely theoretical and of no use in real-world IT infrastructure. Fifield’s method was to develop a new form of a zip bomb—a small file that can be expanded to bring down a server by exceeding the drive capacity—which isn’t a new idea but has never been done to his scale.
“I was initially curious about whether there was a way to bypass the code signing requirement on Android,” Fifield told TechRepublic. “That led to an understanding of the zip specification and a study of various implementations—the zip specification contains many ambiguities and redundancies, which lend themselves to divergent implementations,” he said. “These proved numerous enough that what I originally intended to be a web page of a few paragraphs grew into a scientific paper.”
Unfortunately, “The zip bomb can have a high compression ratio because it only stores useless data. It does not apply to compression of meaningful files in any way,” Fifield explained. Future directions may involve similar work with PDF data, he said.
Percentages aside, data that requires urgent processing or low latency definitely shouldn’t be compressed, Budman said. “Today, for a practical user, the thing they can do to be most efficient with their storage is to figure out what data they have, where should it go, and how should they get it there.”