Compression and Encoding

What is Encoding

We have talked about why analytics favors a column-based approach and how Vertipaq works with columns. We are going to dive a little bit deeper. We are going to show how using a columnar data store allows for really impressive compression, which improves performance dramatically.

So, to understand how compression works in the Vertipaq engine, we need to understand encoding and what it is. Encoding is basically how do you store the data on disk. It turns out that when data follows certain patterns, we can take advantage of those patterns to take up far less space than we normally would.

Types on Encoding

  1. Value Encoding
    This is gonna be pretty uncommon, but this is where you take a series of integers that are all roughly in the same narrow range and subtract the base value from all of them. Because they are all shorter, It is possible to store them in a way that requires fewer bytes per value.

  2. Dictionary Encoding
    This is where common values in a column are replaced with numerical references to a dictionary object which can be referenced later on. For column with repeated values, this can save a significant amount of space.

  3. Run-Length Encoding
    This is where you have long runs of the same value, like, say, the name of a state. Instead of repeating that value every single time, you store the value and then the length of that run. So if "Ahmed" shows up 30 times consecutively in a column, you would simply store "Ahmed" and the number 30. For columns with millions of items, this can save tremendous amounts of space.

Now that we have covered the different types of encoding, let's go into a little bit more detail for each of them.

Value Encoding

AccountAccount
850060016001
850060026002
850060036003
850060046004
850060056005
850060066006
850060076007

Have you ever gotten a medical bill where your account number was 85006016 and perhaps laughed because you know for a fact that that hospital does not have 85 million patients. I know that I have. I've wonder why those numbers are so goofy. But I think what they are doing is they are using the first four digits as the department number or something like that, or there was was some new ERP system and so they hade to renumber all the customers. I'm not Sure. But, if you have a column of account number like that, it is pretty inefficient to write them all out when maybe you only have a few thousands accounts.

Value Encoding says, let's grab the lowest possible value in the range (85000000) and then store everything else as the delta or difference from that value. Now, instead of needing eight digits to store every account number, we only need four. Thant's half the space. The same concept applies with values encoding, but instead of decimal digits, it is binary.

Dictionary Encoding

ColorKeyColorColor
Blue1Blue1
Green2Green2
Green3Red2
Red3
Red3
Red3
Red3

Value Encoding works great for integers that happen to be in narrow range, but what if we are dealing with text values? For that and nearly everything else, we use Dictionary Encoding. Dictionary Encoding is pretty simple when you get down to it. If a column consists of a few unique values, like a column of states or countries, the engine goes through and assign each unique value a number.

This creates an object called a dictionary, as it is known in the computer world. This dictionary is a mapping between the actual value and a shorter, simpler numerical key. Then in the original data, we replace all the actual values with the corresponding keys. This gives us a list of integers that takes far less space and is far quicker for the computer to read and process.

Oftentimes, you don't need to consult the dictionary for every entry. For example, if we need to filter on read, the computer can just filter the entire column on the number 3, and it never has to read the dictionary a second time.

Run-Length Encoding

This is only possible because we are thinking in columns and not rows. When you have a column, you are likely to have runs or repeated groupings of values. This is especially true when storing becomes involved. If you have a column with a million items in it, it's completely possible for one of those values, say, a state name, to show up hundreds of thousands of times. Instead of repeating that value over and over and over, taking up unnecessary space, it's possible to condense it.

In this case, what we do is we store the value, and then we store the length of the run, or essentially, the length of the number of repeated values. So, here in this example, red is being repeated four times. And so instead of writing out red, red, red, red, we write red, 4. The same concept applies for DAX.

The important thing to take away is that by thinking in columns, we can take a huge number of a few unique values and compress them into efficient storage. This wouldn't be possible with your traditional row-based date stores, because some of our encoding depends on large runs of the same value. In short, Columnar Storage turns repeated values from a waste of space to an asset and allows us to deal with larger datasets than we normally could.

Columnar Storage turns repeated values from a waste of space to an asset.