Infinite compression

Sorry to post something technical but I just have to share this with somebody, and my wife’s already heard it and people at work are sick of hearing about it.

Imagine with me, if you will, that you have 32 kilobytes of data. That is 32,768 letters, numbers, or symbols, if you want to look at it one way. How many different possible combinations is that?

Well, if you only have one byte (that’s the technical word for one of these letters, numbers, etc.), you can have 256 possible combinations. If you have two bytes, then you have 2562, or 256 × 256, or 65,536 combinations. If you have three bytes, then you have every possibility for the first two bytes, times 256 more possibilities in combination with them, or 2563, or 256 × 256 × 256, or 16,777,216 combinations.

So, if you have 32,768 bytes, you have 25632,768, or roughly 3.23170058 × 10616 different possibilities as to what’s in there. That’s 323,170,058, followed by 609 zeroes.

In our most liberal guesses, the number of atoms in the universe is 1081, last I heard. Hopefully that will give you some idea of how many different possibilities there are, in just 32K of RAM. (Your computer these days probably has at least 8,000 times as much memory - not storage space, I’m talking what you use when you turn the thing on and use it - as this.)

Now, let’s say you’re backing up this data only when it changes. How do you keep track of whether the data has changed or not?

You can do this by using what we call a “checksum.” A simple checksum would be to add up every value in your 32,728 bytes of space. Then the maximum value of that checksum that you have to keep in addition to the data is 256 × 32,768, or 8,388,608. It’s a big number to us, but a small number in comparison with all the possibilities. You can store every round number between 0 and 8,388,607 in three bytes of data. (Actually three bytes holds 16,777,216 different possibilities, or 224.)

So, if every night when you want to see if the data changed, you add up all the numbers, and the number changes, you can be sure that your data changed. Simple enough, right?

But what if the checksum didn’t change? What if you changed the first byte from 10 to 12, and the fifteenth byte from 250 to 248? Then you have the same checksum (we added two and subtracted two), but different data. Welcome to “collision space,” where two different sets of data can produce the same checksum.

It gets worse. This assumes that we are actually dealing with truly random data, where each byte could actually have anything from 0 to 255 in it. Let’s say we’re dealing with plain text: letters, numbers, anything you can type on your keyboard, and a couple of control characters. Let’s be generous and say that there are 100 different possibilities instead of 256. That means that the possible checksums are now 100 × 32,768, for a total of only 3,276,800 different checksums. Suddenly it is more than twice as likely that you will find colliding checksums, where two different blocks of text have the same checksum. And data is rarely completely random.

There are workarounds for this. Almost nobody uses plain checksums anymore; we have sophisticated mathematical transformations that produce a new kind of checksum, usually called a “hash,” and it is far better than just adding things up; a change like we had above, where we changed only two bytes, is now far more likely to produce different checksums.

But still, we only have as many checksums or hashes as we have space for them, and they have to be significantly smaller than the data size itself to be of any use to us. Let’s get really crazy, though, and say that we are dealing with a 64-bit checksum. That’s 18,446,744,073,709,551,616 different possibilities.

That’s still not much compared to the size of the data possibilities, though. That means that, for every possible hash of the data, we could have as much as 1.75191 × 10597 different datasets that produce the same checksum, and more if our hashing algorithm doesn’t result in perfect distribution for some reason.

In short, you have to make some decisions to trust your hashing algorithm specifically before you decide that you can use this as a system to determine whether a certain 32K block has changed.

Now, at work we had a vendor this spring (let’s call them Online-Backups-R-Us, or OBRU for short) who was going to use hashing to determine whether data had changed or not. Fine, at this stage of the game that’s really the only way to do backups. So, I asked as part of our own due diligence, can we get any details about the hashing? What safeguards do you have to prevent collisions?

The answer from the OBRU sales droids: the system is perfect, there are no collisions, every different block of data will always have a different checksum.

I’m sorry, what? Again, I’m not against checksums and hashes and so forth, I just want to be sure that OBRU has thought things through. But the salespeople insisted to me, my colleagues, and two levels of management above me, that any block of data of a certain size can be reduced to a smaller block of data which is unique to the first one.

Think about that a minute. That means that your data is infinitely compressible. I mean, you could just convert all your data to checksums (reducing them in size by hundreds of orders of magnitude), then if that’s too big, do it again. It’s OK, you won’t lose anything, because it can always be uniquely hashed. Of course, the trick would be to derive data back out of a hash, but if the one process is possible, then the other must be possible as well, if the hash is truly unique to the original data. Infinite compression! Eventually all the data in the world could be checksummed down to one byte, or one bit. No more MP3s, and instead of DVDs we could all go back to 5.25-inch floppy disks assuming that we had computers that were powerful enough.

Of course that’s ridiculous. But what’s more ridiculous is this: I never found a salesperson or engineer there who would admit that data was not infinitely hashable (and therefore compressible), and most of the people from my work in that meeting believed them and assumed that I was just being hard on them, or that I didn’t really understand. We ended up buying OBRU’s product.

I know that the odds of actually running into a collision in the 64 or 128-bit hash on the same block of data (with changes) is extremely low, I just wanted to be sure I was dealing with engineers and not magicians. But apparently our backups are guaranteed by magic. The only answer that I could get out of OBRU was that NASA was using them. I can only hope that they provided some real answers to NASA. But I’m not even that optimistic, really.

This was not the lowest point of my year at work. But it is one of the few that I can laugh about, albeit with a scowl.

5 Comments

  1. Patrick Chan (620 comments.)
    Posted 1/4/2007 at 7:25 pm | Permalink

    Eh. This is old news. The Japanese perfected this sort of data compression ages ago. Not only that, but they’ve extended it to people, too.

    I know, I know, it’s hard to believe. In fact, I, too, had to see for myself. So I let them try it on me on my last visit to H.K.H.Q. — Hello Kitty Head Quarters.

    (Why I was at H.K.H.Q. is another story. But suffice it to say I am a huge Keroppi fan! Ribbit, ribbit!)

    (Also, how a Japanese toy company came to be the first to develop this amazing people compression technology is beyond me. Good news for us Americans, though: it’s rumored Microsoft is in the works to develop something similar. I believe the project’s code name is “the XPox ≤360.” Their current slogan should work well for them, too: “America, prepare to be infected with the XPox!” Also: “I got a fever and the only prescription is more XPox!”)

    Anyway, long story short, after entering into an ironclad chamber with bright lights and other cool little doohickies flickering on and off, famed Japanese scientist and inventor, Dr. Ganon, by the authority of Princess Toadstool, pushed some sort of a button or pulled some sort of a lever, and… *giggle* *giggle* *google* voila!

    Presto changeo, rearrangeo –

    I, myself, Patrick, now stand before you as a compression of the original, larger Patrick. Or, actually, not so much stand as shrink. Yes, I, myself, Patrick, now shrink before you… Er, or something like that. In short, I am Mini-Me.

    Of course, this technology was not without its setbacks. Especially during the animal testing stage. I mean, how do you think those Godzilla and Mothra documentaries came to be filmed in the first place?

    Fortunately, the Japanese were able to reverse the process. Hence the people compression technology.

    Otherwise, imagine what horrific things might’ve come to pass. Such as, well, let’s not mince words: we might have a giant, hulking Hello Kitty on our hands! Would any of us want that? I’m getting goosebumps just thinking about it. Brr!

  2. Patrick Chan (620 comments.)
    Posted 1/4/2007 at 8:56 pm | Permalink

    And that does stink that some folks at your workplace gave the benefit of the doubt to OBRU rather than to their very own admin! Sheesh.

  3. Michael
    Posted 1/20/2007 at 1:55 pm | Permalink

    Pat,
    Let me shed a little light if I can on your checksum post. Checksum or Hash technology is very accurate and very useful for verifying data integrity. Incredible as it may sound at first, it’s true that a simple string of maybe 50 characters can “represent” a 2 gigabyte file. If you change just one bit of the original data, the Checksum or Hash will show this.
    So you ask, “Why can’t we just turn all data into Checksums/Hash and there by compress all data into just a few bytes?” Here are the nuts and bolts to that question in layman’s terms:
    The physical universe is composed of Mater, Energy, Space and Time. Binary or “Computer” data is “Perfect” data. It has the Exact amount of information needed to express it’s self. No more no less. It’s perfect because it’s defined by it’s self. It has Mater (the one or zero) it was produced with Energy (1 = energy/on and 0=energy/off) it is presented in Time (one bit “after” the other) and it has Space (each bit is in a different Location/Space).
    A Checksum or Hash is the binary file with Space removed. Without space there can be no time of course, so you end up with just Mater and Energy.
    So, a Checksum or Hash is simply a recording/record of the Mater and Energy of a data string. They are a perfect record of the Mater and Energy of a file (provided the Checksum/Hash formula is correct). Therefore, a Checksum/Hash can only tell you if a file is the same or different from another file (or datum).
    In a nut shell, a Checksum/Hash can tell you if something is or is not. However, it can not tell you “What” it is or is not, because it never recorded the space (and there for time) of the original.
    Data Compression is simply the process of trading Space, Energy and Mater with Time. A compressed file simply has more time and less space, energy and mater. It’s efficient/useful because a computer can process tens of millions of bits of “Time” per second.
    The state of our current technology allows us to manipulate Time better than Space, Energy and Mater. So, trading some “Excess” Time for the more rare Space, Energy and Mater is currently a useful process. This could and will change as we improve our technologies in regards to Mater, Energy and or Space.
    You use the technology of Checksums/Hash all day, and every day. Clocks, Stopwatches, calendars, timers, radar guns, Speedometers in your car and even the Sun can tell you that an event did or did not take place or that something is or is not. But they don’t tell you “What” took place. They are all Binary in nature. They simply tell you “Yes” or “No” (on/off) or it is or is not.
    It’s interesting to note that Children usually have more Space, Time and Energy and less Mater than Adults. Adults tend to have less Time, Space and Energy but more Mater. It would appear that the “Fountain of Youth” may have more to do with the proper application of the relationships of Time, Mater, Energy and Space than with water.

    Michael

  4. Posted 1/20/2007 at 2:52 pm | Permalink

    In other news, this week was next week, last week.

  5. P.M. Hawryschuk
    Posted 1/31/2007 at 12:56 am | Permalink

    I hope the second person to respond was trying to be as amusing as the first.

    I would definitely have to side with you on this issue.

    I’ve previously put some thought into the topic of compression and what I like to call “supper compression”. Ex. AB = A; and the only way this becomes viable is to have the size of the compression utility itself increase in size, by having it store multiple supper compression algorithms.
    Even though supper compression is possible infinite compression is not. To utilize multiple algorithms and the reapplication of them over and over again, the file’s amount of data for storing the history of how it was compressed grows the more times it has to change algorithms or reapply them and at a certain point the compressed data becomes small enough for it to not be worth adding more compression data to the compressed data (unless you choose to have the actual compression utility to grow in size more; the data has to go somewhere and there’s always a limit)

    Anyways I trailed off a little on the topic of compression to mention that infinite compression is not possible. As for the product from OBRU, no purchase should have been made from a person spewing lies such as “there are no collisions”. As a sales person they should have been trained to demonstrate how their product compares to another one. Ex. If not showing their algorithm, at least showing that their product results in %30 less possibility of collisions; or something, at least something, not just that “NASA was using them”.

    Sorry to hear that no one at work cared enough to spend a minute or two to sit and think, to realize that maybe you had a valid point.

    Peace out,
    Philpé

Bad Behavior has blocked 950 access attempts in the last 7 days.