De-Duplicating Data Sets with Python
Introduction
There are occasions when it is necessary to de-duplicate records before further transformations are required and before loading into the data model target database server, whether Pyramid’s own in memory engine or some other RDBMS or analytic engine.
Pyramid provides a data transformation block in the “Data Preparation” section of Model (Advanced View) called “Distinct” which will de-duplicate based on identical records encountered in the data stream. However, sometimes you need to de-duplicate based on only a subset of the data fields in the record, i.e., the records are distinct only for those fields. Excel provides a de-dupe function upon which you can select fields to base the de-duplication. How can we achieve the same capability in our Pyramid data preparation flows?
Approach
The Python Package, Pandas, provides In the words of its homepage, “Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language”.
Pandas provides a ‘drop_duplicates()’ method that can be applied to Python DataFrames, which will de-duplicate records in the DataFrame based on specified fields.
From the Pandas website, the syntax is:
Implementation
Consider the following model, where we are loading some data downloaded in this case from the London Metals Exchange, showing the summary transactions for metals trading:
I want to de-duplicate the records based on the Date and Category. You can see there are multiple records for Date = 2013-01-02 and Category = Non Ferrous.
We can create a very simple Python script using the ‘drop_duplicates()’ method:
import pandas as pd
outputDF=inputDF.drop_duplicates(subset=['Date','Category'])
We are not specifying the values for the parameters ‘keep’ and ‘inplace’ as we are using the defaults, ‘first’ and ‘false’.
Using the ‘Add All’ and ‘Auto Detect’ buttons makes it very simple to add the data fields to the input DataFrame and create the output DataFrame for further processing. Note that the output will Create New Table.
We can see the results of the Python Script by clicking on the Python data flow block:
It is evident that we have successfully de-duplicated based on the Date and Category fields, as now there is only one instance of 2013-01-02 and Non Ferrous.
Summary
Using Python scripts in the data flow stage of building models provides an almost infinite degree of flexibility and functionality. Using the Pandas library makes many data wrangling tasks simple, easy and quick to implement and I would encourage you to read further on this topic if this is of importance to you.
I hope this article has been of interest and I’m eager to see any other novel uses of Python (or R) to access, retrieve, and process data into Pyramid.
Ian
Resources
Attached to this article is a zip file, de-dupe.zip that contains the model and the data file used in the discussion above.
Reply
Content aside
- 5 yrs agoLast active
- 100Views
- 1 Following