Skip to main content

0

De-Duplicating Data Sets with Python

VP Product Management
Ian_Macdonald
updated 6 yrs ago

de-dupe

Introduction

There are occasions when it is necessary to de-duplicate records before further transformations are required and before loading into the data model target database server, whether Pyramid’s own in memory engine or some other RDBMS or analytic engine.

Pyramid provides a data transformation block in the “Data Preparation” section of Model (Advanced View) called “Distinct” which will de-duplicate based on identical records encountered in the data stream. However, sometimes you need to de-duplicate based on only a subset of the data fields in the record, i.e., the records are distinct only for those fields. Excel provides a de-dupe function upon which you can select fields to base the de-duplication. How can we achieve the same capability in our Pyramid data preparation flows?

Approach

The Python Package, Pandas, provides In the words of its homepage, “Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language”.

Pandas provides a ‘drop_duplicates()’ method that can be applied to Python DataFrames, which will de-duplicate records in the DataFrame based on specified fields.

From the Pandas website, the syntax is:

Implementation

Consider the following model, where we are loading some data downloaded in this case from the London Metals Exchange, showing the summary transactions for metals trading:

I want to de-duplicate the records based on the Date and Category. You can see there are multiple records for Date = 2013-01-02 and Category = Non Ferrous.

We can create a very simple Python script using the ‘drop_duplicates()’ method:

import pandas as pd

outputDF=inputDF.drop_duplicates(subset=['Date','Category'])

We are not specifying the values for the parameters ‘keep’ and ‘inplace’ as we are using the defaults, ‘first’ and ‘false’.

Using the ‘Add All’ and ‘Auto Detect’ buttons makes it very simple to add the data fields to the input DataFrame and create the output DataFrame for further processing. Note that the output will Create New Table.

We can see the results of the Python Script by clicking on the Python data flow block:

It is evident that we have successfully de-duplicated based on the Date and Category fields, as now there is only one instance of 2013-01-02 and Non Ferrous.

Summary

Using Python scripts in the data flow stage of building models provides an almost infinite degree of flexibility and functionality. Using the Pandas library makes many data wrangling tasks simple, easy and quick to implement and I would encourage you to read further on this topic if this is of importance to you.

I hope this article has been of interest and I’m eager to see any other novel uses of Python (or R) to access, retrieve, and process data into Pyramid.

Ian

Resources

Attached to this article is a zip file, de-dupe.zip that contains the model and the data file used in the discussion above.

Content aside

6 yrs agoLast active
109Views
1 Following

Related Articles

Pyramid 2025 "Newton" - You Asked, We Delivered

Pyramid 2025 Issues Addressed

Learning Hub Guide: Embedded Analytics with Pyramid

Learning Hub Guide: Data Concepts and Integrating Calculations in Pyramid

Learning Hub Course: Basic Dashboard Design Principles

Talk to your Data and See It on the Map

Enhancing Accessibility in Data Analytics: The Importance of Dyslexic-Friendly Fonts

Ensure you have a Communication Plan and Measure Your Success

Craft a Comprehensive Delivery Plan

Make Data Relatable through Story Telling

Fostering a Culture of Data Literacy

Empower the Executive Sponsor

Learning Hub Use Case: Predictive Modeling

Learning Hub Use Case: Finance Business Analyst

Blog: Superbowl 2024 - Interactive Football Dashboard

Learning Hub Use Case: Retail Business Analyst

Blog: The Data Shark Podcast - Omri Kohl

Pyramid 2023.10: Issues Addressed

Upgrading a Pyramid Oracle repository

Pyramid 2023.10: Your Product Ideas Delivered

Pyramid 2023.01: Your Product Ideas Delivered

Decision Intelligence Blog: How to choose the right calculation method in Pyramid

Pyramid 2023 Issues Addressed

Pyramid 2023: Your Product Ideas Delivered

Decision Intelligence Blog: Leveraging Pyramid Analytics to Migrate to Data Mesh Architecture (Guest Blog)

Decision Intelligence Blog: Reports, Dashboarding and Color Blindness

Pyramid's Product Development Process

New Features – 2020.23

Parent-Child Hierarchy Security

Persisting Visualization Colors

Custom Subtotals

Troubleshooting User Setup and Access Issues

Custom Data Connectors

Multiple Chart Types in a Single Report

Multi-Factor Authentication

Actionable Analytics & Custom Workflows

Numeric Formatting in Pyramid

Report Personalization

Automated Key Driver Analysis

Maintaining Data Models

Creating Parent Child Hierarchies

Cumulative Totals

Pyramid 2020.20 Issues Addressed

Implementing row level security on any database

Time Intelligence: How date lists and ranges accelerate insights

Time intelligence: Custom calculations for greater perspective

Time intelligence: Date-time calculations made simple

Time intelligence: Date-time groupings for better visibility

Time intelligence: A critical element in all analytic projects

Calculating differences without code

Pareto Analysis: Using the 80/20 Rule

Stop Writing Code: Running Totals Made Easy

Amplify your feedback: Pervasive Collaboration with Pyramid

Actionable data insights in minutes—not hours

Learn and predict with Python and Pyramid

Forecasting with Python and Pyramid

Fine tune your forecasts with advanced controls

Comprehensive forecasting with Pyramid

Data-driven dynamic tooltips that add even more context to Pyramid dashboards and reports

How to deploy enterprise-grade “Virtual” Python environments using Pyramid

Data driven messaging to tell a better analytic story

First-class Python integration in Pyramid

How to deliver outstanding analytic performance and scalability on Exasol

“Guided” Analytics using Pyramid

How to build an “analytics lake” using Cross Model Mapping

2020.10 Feature Focus: Content Migration Tools

How to build data-driven infographics that tell a better story

Smart report bursting with triggers and repeaters

Delivering impact: Report bursting and distribution with Pyramid

Dynamic Scaling of Analytics with Pyramid on Kubernetes

Embedding with native HTML5

Hybrid Data Connectivity via Pulse

Smart Insights on ANY Data Source

How dynamic KPIs provide better views into constantly changing datasets

Gaining control of your data with Master Flows

Custom Maps and Coordinate Systems

Pyramid's game-changing NLQ Chat Bot lets users analyze data in whole new way

Faster processing and reduced resource consumption with incremental data refreshing

Pyramid 2020.13 Issues Addressed

2020.10 Feature Focus Datavard Glue: SAP ERP Data Modeling

2020.10 Feature Focus: Audit Trail Framework

2020.10 Feature Focus: Present Lite

2020.10 Feature Focus: Flow Grids and Flow Charts

2020.10 Feature Focus: Data Catalog - Toolkit (Part 2)

2020.10 Feature Focus: Data Catalog – Self-Building Capabilities (Part 1)

2020.10 Feature Focus: Ragged Asymmetric Queries

2020.10 Feature Focus: Calendar Slicer

Creating an IMDB Time column in the Model designer

How to create a Week to Date or Parallel Periods list, for a week that starts on Sunday when using a sql data source

Creating a Date Time intervals in Data modeling tool

How to apply Cumulative Function (Percent of Total) of rows/columns in Charts.

Pyramid 2020.10 Issues Addressed

Real self-service BI for SAP BW and HANA for Pyramid Customers

Database Write Back

Improving Windows Authentication and Kerberos Delegation

The seven success factors of time intelligence

Pyramid 2020.02 (Service Pack 2) Addressed Issues

Pyramid 2020.01 (Service Pack 1) Addressed Issues

Pyramid 2020 Addressed Issues

How to connect Google Sheets to Pyramid

cancel

Mention someone by typing their name

No matching users

Home

View all topics