De-Duplicating Data Sets with Pyramid


Following on from my previous blog on de-duplicating records using Python, I was curious to see if I could achieve the same result with some creative use of the standard Pyramid Model data flow control blocks.

After some initial false starts, I built the following data flow to de-duplicate the same record set using the same data fields.



The principle relies on the use of two Preparation data flow blocks, “Add Sequence” and “Summarize” and the “Join” block.

Add Sequence adds a unique identifier to each record in the flow, either a UUID of alphanumeric characters, or a simple integer rising value.



We then use the Summarize block with Group By applied to the Date and Category fields and the “Min” function applied to the added Seq field. This has the effect of creating a single record for each duplicated set of Date and Category values holding the minimum value of the Seq field for that combination of Date and Category. In fact, it does not really matter if you use the Min or Max, the idea is to hold an original, single unique value of Seq for each Date / Category combination. However, if exactly replicating the Python de-duplicate function, Min will return the first duplicated row and Max, the last.




We then use the Join block to join the summarized records back to the main data flow, joining on the Min_Seq and Seq fields respectively. As each Seq value is unique, we then have a one to one join in effect that will select just the first record in each set of duplicate Date / Category combinations.



We can also drop the Group By and Seq values now they have done their job.



We can see the results below where it is evident that the duplicates have been dropped and more importantly, the results match exactly the results of the Python drop_duplicates() function.




We can see that by some lateral thinking and creative use of the Pyramid data flow blocks, the Python drop_duplicates() method can be replicated directly in the Pyramid Model data flow without recourse to external libraries and functions.

Simple testing on this small data set shows that using only the built in Pyramid data flow blocks results in significantly faster processing. The above data set was processed in around 3 seconds using the pure Pyramid approach, whereas using the external Python libraries took up to 10 times longer at around 30 seconds. Much of this time is probably accounted for by the initialization of and passing the data to Python and back. It may well be that for large data sets the timings are closer together. I will test on a large data set and provide an addendum to this blog once I have the results.

I hope you have found this article of interest and useful.



Attached to this article is a Zip file containing the model definition and the data used. You can download these items and reproduce in your own environment.



Content aside

  • 5 yrs agoLast active
  • 114Views
  • 1 Following