Columnar Databases
Table of contents
A relational row-based database is like a Swiss Army Knife. It is pretty good at a lot of different things, but sometimes, you have a narrowly-defined purpose. Sometimes you need a scalpel instead of a Swiss Army Knife. A columnar database is like that scalpel.
Columnar databases are highly optimized for analytical reporting.
Now we have a better idea of the difference between normal line of business databases and more analytical databases, let's talk a bit about how columnar databases work.
Row-Store Databases
Let's say that you have this big table with a bunch of columns.
Order ID | Order Date | Customer ID | Amount | Type | State |
10001 | 01/01/2021 | 1 | $2.00 | Modest | PA |
10002 | 01/01/2021 | 1 | $2.00 | Sorcerous | WV |
10003 | 01/01/2021 | 2 | $2.00 | Stone | OH |
10004 | 01/01/2021 | 2 | $2.00 | Tropical | PA |
10005 | 01/01/2021 | 3 | $2.00 | Marshy | WV |
10006 | 01/01/2021 | 3 | $2.00 | Meadow | OH |
10007 | 01/01/2021 | 4 | $2.00 | Forest | PA |
Now, we have only six here, but imagine there is 50 of them. And lest's say that your key metric that you want to report on is total sales, year to date, by state. Well, in that case, you don't care about all this other information. You just need the Amount column, the Date column, and the State column. And so again, imagine there is a table with 50columns instead of just the six that you can see.
Despite having all these columns, we only care about three of them. A lot of analytics are like this where we only just need one numerical column and then a few categorical columns to filter by. Wouldn't be great if we could just grab what we need without having to worry about the rest?
Normal relational databases store their data as rows, which means if you need to get any information, you are like to pull out the whole row.
A row-based configuration is very inefficient for analytical processing. We want three pieces of information from each row. but in our case, we have to read each and every single row. And then we have to through out 90% of the data we read anyway. This is slow, and this is inefficient.
We only care about three columns in our example: Amount, State, and Date, that is it. That is all we need for our year to date calculations. So how we can improve this situation?
Column-Store Databases
What is, instead of trying to store our data as rows, we stored them as columns?
Now, we are able to read just the columns that we need, and this is extremely efficient and optimized for analytical processing. THis is how things work with DAX in the Vertipaq engine.
Vertipaq
The engine used to store data as columns, alternatively known as xVelocity.
Vertipaq is the database engine used by Microsoft to store data as Columns, allowing for efficient analytical processing. This is the secret sauce that Microsoft use to allow you to process million of rows on you personal laptop. The marketing buzzword for this and related technologies is called xVelocity.
DirectQuery
The engine that translates DAX formulas into relational SQL queries.
Now, nine times of ten, if you are dealing with DAX, you are dealing with the Vertipaq engine. The other 10% of the time, you are dealing with something called DirectQuery.
DirectQuery is a really cool technology by Microsoft that translate DAX formulas into relational SQL queries. This is the way to take advantage of the conciseness of DAX itself but still use your existing row-store database. Unfortunately you lose a lot of the performance benefits and some of the capabilities of DAX if you use DirectQuery.