Power BI – Remove Duplicates From Multiple Columns
BY: David Rohlfs
For many reports, duplicates will need to be removed from data because it can cause your calculations to be incorrect. But there are also a lot of instances where it can be hard to determine what fields you need to remove the duplicate by. To do this you may need to use multiple columns of data to determine if a row is a duplicate or if it is still valid.
This blog will be a continuation of a previous EPM Strategy blog on Removing Duplicates. It contains some more complex ideas and processes. If you are only wanting to learn the basics, I suggest reading through the Removing Duplicates blog linked below.
In this blog we are going to go over why you would want to remove duplicates by multiple columns and a walkthrough showing you a couple of methods to do so.
When to use it
The reason that you want to use multiple columns to remove duplicates is because there is not one row specific column to remove duplicates on. In these cases, you need to find multiple columns that you are worried about and check for duplicates on those rows.
There are two methods that you can use to remove duplicates in the Power Query on a query. The first method is very much like the normal way of removing duplicates. All you need to do is control select all the columns that you are worried about duplicates and use the remove duplicate’s function. The second way is a little different and for more complex situations where you may want to filter to see the duplicates or remove them.
When using the second method of removing duplicates, you are allowing yourself to see why there are duplicates in your data, whether the duplicates will keep appearing, and if there are any fields that you should add to mitigate the number of duplicates you have.
One thing to note before learning how to remove duplicates is that removing a duplicated row will always keep the first row then remove all the duplicates after that first occurrence. To see an example of this please look at the EPM Strategy blog on Removing Duplicates linked below.
Before we go into the two methods to remove duplicates in this blog, it is important to know how to find if there are duplicates in your data and count your rows. For finding if there are duplicates, I suggest learning how to use the Group By function (Blog link below). And to count rows you just make sure that your query is selected, go to the Transform ribbon, and click Count Rows.
In this example we are looking at three fields. But you will notice that not all the rows are duplicates and may only contain duplicates in a singular column. Our goal is not to remove any duplicates in a single column, but in the group of three columns. I like to think of this as concatenating the values on a row, then removing the duplicates.
For the first method, all you need to do is select the columns you want using the Control key. I am selecting the Product Name, Date of Sale, and Sales Order Number columns.
Then you can go to Remove Rows and select Remove Duplicates.
This will remove all but the first duplicates for whichever entire set of values are duplicated.
For the second method we are adding a new column to give us the option of whether to remove the duplicates or just to find them.
To start we need to make a new column that contains all the columns that we are going to run the remove duplicates on. I am using the Product Name, Date of Sale, and Sales Order Number columns. To learn how to do this read through the Adding a Custom Column blog linked below. When adding in the column, separate each column with a comma. Your code will look like what is below.
Now you should have a column that contains each of the separate fields. In this you can do one of two things. You can either select the column and use the remove duplicates function as normal, or you can do a group by to find the duplicates and do a merge to move the results to the original query.
If you choose to do a group by and merge your results, I suggest reading through EPM Strategy’s blog on Utilizing the Group by Function.
Although removing duplicates is very easy from a simple idea, there are reasons to make it more complex and find why your rows are duplicates. If there are any areas in this blog that you don’t understand or want to find help with, I suggest starting with the blogs linked below.
Links Related to this Article: