2020.10 Feature Focus: Data Catalog – Self-Building Capabilities (Part 1)

David_Gordon
updated 5 yrs ago
3replies

2020.10 is the latest release from Pyramid Analytics. It extends the functionality introduced in our major 2020 release. This is one in a series of posts that highlights some of the major new features.

Pyramid provides users with a self-building and self-maintaining documentation mechanism for data and analytics. The “Data Catalog” helps analysts and administrators find the data or analytical assets they need. In the context of data alone, it serves as an inventory and description of data sources, ETL processes, calculations, and data targets. It can also be used to evaluate the fitness of the data artifacts for intended uses. In the context of analytics, the rich information on reports, dashboards, formulas, and business logic provides users with a catalog of analytical assets. The combination of the two presents a quantum leap in collective knowledge.

The problem

A single, central data catalog ensures a single version of the truth for all data definitions, formulas, reports, and all other data assets. Without it, chaos reigns and data governance becomes extremely difficult to achieve. A decentralized data catalog increases confusion regarding which definition should be used, encourages duplication of definitions, and undermines the concept of “a single version of the truth.”

Some BI tools (such as Power BI and Tableau) encourage users to utilize desktop tools because it is easier to build and manipulate data. This in turn promotes the use of local definitions of data and artifacts, subverting the entire concept of data catalogs. Alternatively, many other BI tools (including Qlik and Sisense), only let users build analytical content in siloed projects. So even in a server deployment the resulting catalogs are one-off, siloed definitions for data and formulae, reports, and dashboards.

The solution

Pyramid addresses data cataloging with a powerful approach based on its server-based architecture, sharable data models, centralized business logic, and reporting. A classic data catalog provides universal documentation on the data model including data sources, targets, and the ETL process. Pyramid delivers this and expands the classic data catalog to include all analytical content as well.

Pyramid creates a self-building, self-maintained data catalog that interconnects the data elements and the analytical content elements in the system, at all levels, all the time. It provides optional capabilities for users to add and adjust descriptions. Users can access the catalog in one of several ways:

When browsing the data models (when users are accessing them in the analytic and reporting tools—as well as the content explorer)
When using powerful search options
When following user recommendations
When viewing ML-driven suggestion listings

The catalog details are exposed to the right users (based on security) at the right time—solving both the headache of sharing information and governance in one step.

Dual catalogs

As explained, there are two aspects to Pyramid’s data cataloging: the classic cataloging of the data model and the cataloging of the analytical content assets.

Data Catalogs: Within data models, the data catalog carried details for each dimension (table), attribute (column), hierarchy, and measure. This applies to data sources, data modeling definition files, ML scripts (Python, R, SAS), custom visual scripts, and data targets—whether they were designed in Pyramid or if they were generated from third party data modeling engines and tools (Microsoft Analysis Services, SAP BW, HANA, et cetera).
Content Catalogs: Every analytical and reporting element in the platform is stored in the content management system (“CMS”), which produces a cohesive data catalog of all analytical artifacts in the system. Every item is tagged with descriptions and audit trail history, with details regarding creation and modification. This applies to: reports, dashboards, publications, illustrations, and all formulaic components (measure and member formulas, list formulas, KPI formulas, and parameter definitions).

Additional functionality

Because of Pyramid’s rich self-cataloging infrastructure, many other derived capabilities can be delivered:

Data Lineage: Users can view the interrelationship between elements in the system to understand the impact of making changes to the stack – from each content item, the calculations down to columns and tables.
Structure Analyzer: Users can do structural analysis across the entire platform or specific content elements to check that the reporting elements are copasetic with the data elements. The data catalog allows us to quickly understand if changes in data have broken reports or calculations—and give the analyzer wizard the tools to let the user fix the problems through a point-and-click interface.
Data Source Changer: As the data catalog is a master framework held in Pyramid’s central repository, it becomes a trivial task to redirect some or all content items to different data sources through a few clicks in the wizard.
Data Catalog Viewer: Users can export the data catalog for a model for all dimensions and measures available from within a Discover report. The catalog for dimensions includes all hierarchies, descriptions, and levels for all columns. The catalog for measures includes description, format, formulation, and aggregation type used within a Discover report.

Note: In Part 2 of this two-part data catalog blog, I will cover these items in more depth. I’ll explore how the data lineage, structure analyzer, and data source changer work and their effect on conjoining the intersection between the content and the data catalog.

Administrative Catalog Functions

Administrators can also derive extended functionality through the Pyramid’s cataloging framework:

Central yet personalized: Administrators can optionally tweak versions of the central data model catalog by user or role, showing subtly different descriptions of data definitions based on need and function using the same objects. The “overlay” function solves the common problem in data cataloging: having a centralized and common data catalog versus the need for end-user specific customizations.

Detailed Telemetry: Separately, telemetry data on the use of each item in the data catalog is captured, allowing administrators to see usage statistics of each content element as well as each data element (down to the column and metric). This further extends the traditional role of data catalogs into a mechanism that can better manage all enterprise analytic content.

Business case

Suzanne, a business analyst at XYZ Retail Inc., is using Pyramid to perform analytics on her ERP system housed in an Oracle data warehouse. As a new employee, she would like to better understand the relationships between the data sources and the ETL processes that have been performed. In addition, it will help her to view the current analytical assets, as well as the numerous formulas that have been created to accommodate useful calculations, so she will not have to “reinvent the wheel.”

End-user catalog access

Suzanne wants to review the descriptions and details of all hierarchies and measures that are available from the current data catalog while creating or reviewing a report. The full description appears as a tooltip when hovering over the hierarchy or measure in a report. In this example, Suzanne hovers her cursor over the column heading for the “Education” hierarchy and the description is displayed as a tooltip.

Measures also display descriptions as a tooltip when the cursor hovers over the measure heading in a report.

The tooltip is also available when hovering over the tree structure of the available hierarchies and measures. In this example, Suzanne hovers over the “Education” hierarchy to better understand its relevance in the data model.

Next, Suzanne wants to export the descriptions and details of all hierarchies and measures that are available from the current data model while creating or reviewing a report. Suzanne simply clicks on the “Data Catalog Export” button.

An exported PDF file is then automatically downloaded, listing all the dimensions, hierarchies, levels, and measures, together with their relevant descriptions.

Catalog design

The descriptions can be set by the model designer when designing a Pyramid model. In this example, from within the data model, after selecting “Gender,” Suzanne can change the description that will subsequently appear in all tooltips when hovering over the column heading or tree structure.

Catalog personalization

And finally, the accounting department wants to use the term “Revenue” instead of “Sales” on all reports. In the administrator’s source manager below, we see the default descriptions for all the measures viewed by all users. The “Sales” column has its own description that is displayed as a tooltip when hovering over the measure.

Admins can create overlays for the role used exclusively by the accounting department. The screen below illustrates how the Sales measure’s name and description have been modified exclusively for the user group “Consumers,” which is used by the accounting department.

When a user from the accounting department accesses the report, the name and description that appears as a tooltip are automatically modified. Note how the term “Sales” has been replaced by the term Revenue on the measure, the tooltip, and even in the report heading

Suzanne wins on both accounts: by (1) maintaining a central data catalog on the one hand, while (2) allowing a customizable solution within a controlled environment—without requiring duplication of definitions and sowing confusion.

Summary

Pyramid provides a self-building and self-maintaining data catalog engine to assist users in viewing the data catalog for their data models and analytical assets.

The Pyramid approach stands in strong contrast to desktop-based analytical tools or tools that encourage the use of siloed data and analytical projects which cannot build or utilize centralized data cataloging functions accessible by all authorized users in a common, consistent, and cohesive manner.

The server-based architecture of Pyramid provides an ideal platform for sharing data model details. It expedites the use of powerful tools to make changes to data sources, while quickly and easily rectifying resultant errors. Users can access the catalog through browsing the data catalog exporter content explorer; using extensive search options; utilizing user recommendations; or via an ML-driven suggestion listing. This provides a feature-rich, user-friendly interface to help users manage their data assets. Administratively, the data catalog can be “overlayed” by user or role, providing customized alternative views, while retaining a centralized, single data definition.

This post originally appeared at https://www.pyramidanalytics.com/blog/details/blog-2020.10-feature---data-catalog---self-building-capabilities---part-1