Types of Data Encoding

What is Encoding

We have talked about why analytics favors a column-based approach and how Vertipaq works with columns. We are going to dive a little bit deeper. We are going to show how using a columnar data store allows for really impressive compression, which improves performance dramatically.

So, to understand how compression works in the Vertipaq engine, we need to understand encoding and what it is. Encoding is basically how do you store the data on disk. It turns out that when data follows certain patterns, we can take advantage of those patterns to take up far less space than we normally would.

Types on Encoding

Value Encoding
This is gonna be pretty uncommon, but this is where you take a series of integers that are all roughly in the same narrow range and subtract the base value from all of them. Because they are all shorter, It is possible to store them in a way that requires fewer bytes per value.
Dictionary Encoding
This is where common values in a column are replaced with numerical references to a dictionary object which can be referenced later on. For column with repeated values, this can save a significant amount of space.
Run-Length Encoding
This is where you have long runs of the same value, like, say, the name of a state. Instead of repeating that value every single time, you store the value and then the length of that run. So if "Ahmed" shows up 30 times consecutively in a column, you would simply store "Ahmed" and the number 30. For columns with millions of items, this can save tremendous amounts of space.

Now that we have covered the different types of encoding, let's go into a little bit more detail for each of them.

Value Encoding

Account		Account
85006001		6001
85006002		6002
85006003		6003
85006004		6004
85006005		6005
85006006		6006
85006007		6007

Have you ever gotten a medical bill where your account number was 85006016 and perhaps laughed because you know for a fact that that hospital does not have 85 million patients. I know that I have. I've wonder why those numbers are so goofy. But I think what they are doing is they are using the first four digits as the department number or something like that, or there was was some new ERP system and so they hade to renumber all the customers. I'm not Sure. But, if you have a column of account number like that, it is pretty inefficient to write them all out when maybe you only have a few thousands accounts.

Value Encoding says, let's grab the lowest possible value in the range (85000000) and then store everything else as the delta or difference from that value. Now, instead of needing eight digits to store every account number, we only need four. Thant's half the space. The same concept applies with values encoding, but instead of decimal digits, it is binary.

Dictionary Encoding

Color	Key	Color	Color
Blue	1	Blue	1
Green	2	Green	2
Green	3	Red	2
Red			3
Red			3
Red			3
Red			3

Value Encoding works great for integers that happen to be in narrow range, but what if we are dealing with text values? For that and nearly everything else, we use Dictionary Encoding. Dictionary Encoding is pretty simple when you get down to it. If a column consists of a few unique values, like a column of states or countries, the engine goes through and assign each unique value a number.

This creates an object called a dictionary, as it is known in the computer world. This dictionary is a mapping between the actual value and a shorter, simpler numerical key. Then in the original data, we replace all the actual values with the corresponding keys. This gives us a list of integers that takes far less space and is far quicker for the computer to read and process.

Oftentimes, you don't need to consult the dictionary for every entry. For example, if we need to filter on read, the computer can just filter the entire column on the number 3, and it never has to read the dictionary a second time.

Run-Length Encoding

This is only possible because we are thinking in columns and not rows. When you have a column, you are likely to have runs or repeated groupings of values. This is especially true when storing becomes involved. If you have a column with a million items in it, it's completely possible for one of those values, say, a state name, to show up hundreds of thousands of times. Instead of repeating that value over and over and over, taking up unnecessary space, it's possible to condense it.

In this case, what we do is we store the value, and then we store the length of the run, or essentially, the length of the number of repeated values. So, here in this example, red is being repeated four times. And so instead of writing out red, red, red, red, we write red, 4. The same concept applies for DAX.

The important thing to take away is that by thinking in columns, we can take a huge number of a few unique values and compress them into efficient storage. This wouldn't be possible with your traditional row-based date stores, because some of our encoding depends on large runs of the same value. In short, Columnar Storage turns repeated values from a waste of space to an asset and allows us to deal with larger datasets than we normally could.

Columnar Storage turns repeated values from a waste of space to an asset.

Compression and Encoding

Table of contents

What is Encoding

Types on Encoding

Value Encoding

Dictionary Encoding

Run-Length Encoding