Skip to main content

2

De-Duplicating Data Sets with Pyramid

VP Product Management
Ian_Macdonald
6 yrs ago

pyramid dedupe

Introduction

Following on from my previous blog on de-duplicating records using Python, I was curious to see if I could achieve the same result with some creative use of the standard Pyramid Model data flow control blocks.

After some initial false starts, I built the following data flow to de-duplicate the same record set using the same data fields.

Approach

The principle relies on the use of two Preparation data flow blocks, “Add Sequence” and “Summarize” and the “Join” block.

Add Sequence adds a unique identifier to each record in the flow, either a UUID of alphanumeric characters, or a simple integer rising value.

We then use the Summarize block with Group By applied to the Date and Category fields and the “Min” function applied to the added Seq field. This has the effect of creating a single record for each duplicated set of Date and Category values holding the minimum value of the Seq field for that combination of Date and Category. In fact, it does not really matter if you use the Min or Max, the idea is to hold an original, single unique value of Seq for each Date / Category combination. However, if exactly replicating the Python de-duplicate function, Min will return the first duplicated row and Max, the last.

We then use the Join block to join the summarized records back to the main data flow, joining on the Min_Seq and Seq fields respectively. As each Seq value is unique, we then have a one to one join in effect that will select just the first record in each set of duplicate Date / Category combinations.

We can also drop the Group By and Seq values now they have done their job.

We can see the results below where it is evident that the duplicates have been dropped and more importantly, the results match exactly the results of the Python drop_duplicates() function.

Conclusions

We can see that by some lateral thinking and creative use of the Pyramid data flow blocks, the Python drop_duplicates() method can be replicated directly in the Pyramid Model data flow without recourse to external libraries and functions.

Simple testing on this small data set shows that using only the built in Pyramid data flow blocks results in significantly faster processing. The above data set was processed in around 3 seconds using the pure Pyramid approach, whereas using the external Python libraries took up to 10 times longer at around 30 seconds. Much of this time is probably accounted for by the initialization of and passing the data to Python and back. It may well be that for large data sets the timings are closer together. I will test on a large data set and provide an addendum to this blog once I have the results.

I hope you have found this article of interest and useful.

Ian

Resources

Attached to this article is a Zip file containing the model definition and the data used. You can download these items and reproduce in your own environment.

Content aside

2 Likes
6 yrs agoLast active
132Views
1 Following

Related Articles

Pyramid 2025 "Newton" - You Asked, We Delivered

Pyramid 2025 Issues Addressed

Learning Hub Guide: Embedded Analytics with Pyramid

Learning Hub Guide: Data Concepts and Integrating Calculations in Pyramid

Learning Hub Course: Basic Dashboard Design Principles

Talk to your Data and See It on the Map

Enhancing Accessibility in Data Analytics: The Importance of Dyslexic-Friendly Fonts

Ensure you have a Communication Plan and Measure Your Success

Craft a Comprehensive Delivery Plan

Make Data Relatable through Story Telling

Fostering a Culture of Data Literacy

Empower the Executive Sponsor

Learning Hub Use Case: Predictive Modeling

Learning Hub Use Case: Finance Business Analyst

Blog: Superbowl 2024 - Interactive Football Dashboard

Learning Hub Use Case: Retail Business Analyst

Blog: The Data Shark Podcast - Omri Kohl

Pyramid 2023.10: Issues Addressed

Upgrading a Pyramid Oracle repository

Pyramid 2023.10: Your Product Ideas Delivered

Pyramid 2023.01: Your Product Ideas Delivered

Decision Intelligence Blog: How to choose the right calculation method in Pyramid

Pyramid 2023 Issues Addressed

Pyramid 2023: Your Product Ideas Delivered

Decision Intelligence Blog: Leveraging Pyramid Analytics to Migrate to Data Mesh Architecture (Guest Blog)

Decision Intelligence Blog: Reports, Dashboarding and Color Blindness

Pyramid's Product Development Process

New Features – 2020.23

Parent-Child Hierarchy Security

Persisting Visualization Colors

Custom Subtotals

Troubleshooting User Setup and Access Issues

Custom Data Connectors

Multiple Chart Types in a Single Report

Multi-Factor Authentication

Actionable Analytics & Custom Workflows

Numeric Formatting in Pyramid

Report Personalization

Automated Key Driver Analysis

Maintaining Data Models

Creating Parent Child Hierarchies

Cumulative Totals

Pyramid 2020.20 Issues Addressed

Implementing row level security on any database

Time Intelligence: How date lists and ranges accelerate insights

Time intelligence: Custom calculations for greater perspective

Time intelligence: Date-time calculations made simple

Time intelligence: Date-time groupings for better visibility

Time intelligence: A critical element in all analytic projects

Calculating differences without code

Pareto Analysis: Using the 80/20 Rule

Stop Writing Code: Running Totals Made Easy

Amplify your feedback: Pervasive Collaboration with Pyramid

Actionable data insights in minutes—not hours

Learn and predict with Python and Pyramid

Forecasting with Python and Pyramid

Fine tune your forecasts with advanced controls

Comprehensive forecasting with Pyramid

Data-driven dynamic tooltips that add even more context to Pyramid dashboards and reports

How to deploy enterprise-grade “Virtual” Python environments using Pyramid

Data driven messaging to tell a better analytic story

First-class Python integration in Pyramid

How to deliver outstanding analytic performance and scalability on Exasol

“Guided” Analytics using Pyramid

How to build an “analytics lake” using Cross Model Mapping

2020.10 Feature Focus: Content Migration Tools

How to build data-driven infographics that tell a better story

Smart report bursting with triggers and repeaters

Delivering impact: Report bursting and distribution with Pyramid

Dynamic Scaling of Analytics with Pyramid on Kubernetes

Embedding with native HTML5

Hybrid Data Connectivity via Pulse

Smart Insights on ANY Data Source

How dynamic KPIs provide better views into constantly changing datasets

Gaining control of your data with Master Flows

Custom Maps and Coordinate Systems

Pyramid's game-changing NLQ Chat Bot lets users analyze data in whole new way

Faster processing and reduced resource consumption with incremental data refreshing

Pyramid 2020.13 Issues Addressed

2020.10 Feature Focus Datavard Glue: SAP ERP Data Modeling

2020.10 Feature Focus: Audit Trail Framework

2020.10 Feature Focus: Present Lite

2020.10 Feature Focus: Flow Grids and Flow Charts

2020.10 Feature Focus: Data Catalog - Toolkit (Part 2)

2020.10 Feature Focus: Data Catalog – Self-Building Capabilities (Part 1)

2020.10 Feature Focus: Ragged Asymmetric Queries

2020.10 Feature Focus: Calendar Slicer

Creating an IMDB Time column in the Model designer

How to create a Week to Date or Parallel Periods list, for a week that starts on Sunday when using a sql data source

Creating a Date Time intervals in Data modeling tool

How to apply Cumulative Function (Percent of Total) of rows/columns in Charts.

Pyramid 2020.10 Issues Addressed

Real self-service BI for SAP BW and HANA for Pyramid Customers

Database Write Back

Improving Windows Authentication and Kerberos Delegation

The seven success factors of time intelligence

Pyramid 2020.02 (Service Pack 2) Addressed Issues

Pyramid 2020.01 (Service Pack 1) Addressed Issues

Pyramid 2020 Addressed Issues

How to connect Google Sheets to Pyramid

cancel

Mention someone by typing their name

No matching users

Home

View all topics