Data cleaning in DAX: Splitting text strings
This tutorial is also published on my personal website.
When working on datasets that contain text strings, it is common that we need to split a text string into two or more portions, or to extract a part of the text string. This quick tutorial will show how we can work with text strings in Power BI using DAX language.
Recently, I was working on a dataset of job postings for Data Analyst position in the US on Glassdoor. The dataset includes several parts of the job postings including the estimated salary, but when I tried to visualize the distribution of salary category I realized the salary groups are inconsistently categorized, meaning that a job has been estimated a salary between $46K-$87K and another posting has the value of $51K-$88K and so on. Another issue was that there is a constant text of “(Glassdoor est.)” in front of each value, separated by a space character.
In this case the first step is to remove the constant text from the values. The easy way is to use “Split Column” feature in Power Query Editor. There are different options available to make splits, in this case we choose “Split by Delimiter” and will select Space as the delimiter.
Now we have split the column into two different columns. We can delete the second column as we will just need the first one. In the first column, now we have different ranges of salaries. We can either make our own groupings, or calculate the mean for each row and visualize the distribution of salary means. Therefore, we will need to extract two numbers from the text string, the minimum and maximum of each row’s salary range.
One thing we should take into consideration is that not all the rows have the same length of characters (Some salaries have three digits while some have two). In below code, I defined separate variables for the minimum and the maximum range and at the end calculated the mean and assigned it to another variable:
Salary AVG =var SalMin = MID([Salary Group], 2, SEARCH("-", [Salary Group],1)-3)var SalMax = MID([Salary Group], SEARCH("-", [Salary Group],2)+2 ,IF(LEN([Salary Group]) = 9, SEARCH("K", [Salary Group],7)-7, SEARCH("K", [Salary Group],7)-8))var SalAVG = (SalMax + SalMin)/2return SalAVG
Variable SalMin
takes characters starting from the second character of column Salary Group
and goes further, then searching for the “-” character and specifies the character number that contains the “-” character. As it can be seen we need to subtract 3 characters to reach the minimum range number.
If we use the same formula for the maximum range number, we will receive undesired output for the ranges that contain three digit numbers for both minimum and the maximum range. To avoid this, I let an IF statement decides how much the MID formula should go further in extracting the range number.
In order to ensure the right outcome generated, I have added two other columns for the minimum and the maximum range. It is suggested that to use as few calculated columns as possible in our dataset to avoid performance issues. So once we ascertained about the correctness of the output, we can delete those columns.
When working with datasets, it is common that we need to clean the dataset or customize it based on the data we require for executing analytics. Above mentioned techniques can be used when we need to quickly organize text strings or extract values from a text.