Splitting numeric and text column data
In the VertiPaq engine, data is stored in a columnar data format. This means for a given query or table, each column is given a data type and all records for each column are stored independently from each other, allowing for optimized compression for each data type.
Numeric data type data in the data model uses value encoding to store the data in an optimized way. Value encoding is where data gets stored on disk or in memory based on a mathematical operation in order to gain compression. For example, say we have a table of numeric values, as follows:
Figure 8.1 – Example ZIP code table with data values without value encoding compression
Using value encoding, the data stored might look more like this:
Figure 8.2 – Example ZIP code table with value encoding compression
To arrive at the value-encoded data, we simply subtract the lowest value from the table: 45674. This means that the underlying numeric type can often be changed from long (64-bit integers), to integer (32-bit integers), to short (16-bit integers), which can have a dramatic impact on the amount of memory or disk storage needed to store the data. Power BI can use much more advanced mathematic algorithms to do value encoding.
Text and non-numeric data type columns use hash encoding. Hash encoding is similar in that it replaces the original value with another value that takes up fewer bytes for storage; however, the algorithm used to store the values tends to need to store an entire dictionary in order to encode and decode the values (rather than a simple mathematical formula). Hash encoding tends to be less efficient than value encoding so it is important to use tools such as Tabular Editor to provide hints to the VertiPaq engine to ensure the best encoding is used.
In some cases, you may have text data type columns that store both text and numeric data. In some cases, it can be helpful to split a column such as this into two columns, setting one as the text data type and the other as numeric. This allows the VertiPaq engine to use hash encoding on the text component and value encoding on the numeric component. Even greater efficiency can be achieved when there is high cardinality in the text component (reducing the size of the dictionary-needed hash encoding) of the column.
Splitting columns is a feature of the Power Query Editor, and it’s possible to split columns by the following:
- Delimiter
- Number of characters
- Positions
- Lower- and uppercase characters
- Digit and non-digit characters
Next, we’ll look at other ways of optimizing using measures, relationships, and visuals.