Enforcing data quality in data pipelines with PyDeequ

In today’s data-driven world, the quality of data plays a critical role in decision-making, business operations, and overall success. Data quality refers to the accuracy, completeness, consistency, and reliability of data. Organizations rely on high-quality data to gain insights, make informed decisions, improve customer experiences, and drive innovation. This blog post explores the significance of data quality and why it should be a top priority for businesses.

Reliable Decision-Making:

Accurate and reliable data is crucial for making informed decisions. Poor data quality can lead to incorrect insights and flawed decision-making processes. By ensuring data accuracy and consistency, organizations can have confidence in their decision-making, enabling them to respond effectively to market trends, identify growth opportunities, and mitigate risks.

Enhanced Customer Experiences:

Data quality directly impacts customer experiences. Inaccurate or incomplete customer data can lead to ineffective marketing campaigns, poor customer service, and lost business opportunities. By maintaining high-quality data, organizations can personalize customer interactions, deliver relevant offers, and provide seamless experiences, resulting in increased customer satisfaction and loyalty.

Trust and Credibility:

Data quality is fundamental in establishing trust and credibility with stakeholders, including customers, partners, and regulatory bodies. Inaccurate or unreliable data can damage an organization’s reputation and lead to legal and compliance issues. By prioritizing data quality, businesses can maintain trust, demonstrate transparency, and ensure compliance with data protection regulations.

Efficient Operations:

Data quality is essential for efficient business operations. Inaccurate or inconsistent data can cause inefficiencies, delays, and errors across various processes, such as supply chain management, inventory control, and financial reporting. By investing in data quality measures, organizations can streamline operations, improve productivity, and reduce costs.

Effective Analytics and Insights:

Data analytics and insights drive innovation and competitive advantage. However, poor data quality can undermine the accuracy and reliability of analytical outcomes. By ensuring data quality, organizations can derive meaningful insights, identify patterns, and make data-driven decisions that positively impact business growth and performance.

A piece of code can be tested by writing unit or integration tests but how can a data file with millions of rows be tested. In this blog, we elucidate PyDeequ, an open-source python wrapper over Deequ (an open-source tool developed and used at Amazon). Deequ is written in Scala, whereas PyDeequ allows you to use its data quality and testing capabilities from python ad pyspark.

PYDEEQU:

PyDeequ helps improve data quality by providing automated data quality checks, metrics, and profiling capabilities. It integrates with Apache Spark, a powerful distributed data processing framework, to perform scalable data quality analysis on large datasets. Here’s how PyDeequ helps improve data quality:

Automated Data Quality Checks: PyDeequ allows you to define data quality checks using a declarative syntax. You can specify a set of constraints that the data should adhere to, such as column completeness, uniqueness, or range validation. PyDeequ automates the process of running these checks on your data, identifying data quality issues, and reporting the results.

Metrics and Profiling: PyDeequ provides a wide range of built-in metrics that help you understand the characteristics of your data. These metrics include descriptive statistics, distribution analysis, uniqueness, and correlation between columns. By leveraging these metrics and profiling capabilities, you gain insights into the data distribution, identify outliers, detect missing values, and uncover potential data anomalies.

Anomaly Detection: PyDeequ supports anomaly detection by comparing the actual data distribution against an expected distribution. It helps identify data points that deviate significantly from the expected patterns, allowing you to detect and investigate potential data quality issues or anomalies in your dataset.

Data Quality Monitoring: PyDeequ enables you to set up data quality monitoring pipelines. You can schedule regular data quality checks, track changes over time, and receive notifications or alerts when data quality issues are detected. This proactive monitoring helps ensure ongoing data quality and provides early detection of issues that could impact downstream processes and decision-making.

Integration with Apache Spark: PyDeequ leverages the distributed computing capabilities of Apache Spark, enabling scalable data quality analysis on large datasets. It can process data in parallel across a cluster, making it suitable for big data environments and data-intensive workloads.

By using PyDeequ, data engineers, data scientists, and data analysts can automate data quality checks, gain insights into data characteristics, detect anomalies, and establish data quality monitoring processes. This helps organizations ensure the accuracy, completeness, consistency, and reliability of their data, leading to improved decision-making, enhanced customer experiences, and efficient business operations.

Example

This example illustrates the usage and functionality of pydeequ on food sales sample data in a SageMaker notebook.

Install pydeequ, pyspark and Import necessary packages

start a PySpark session in a SageMaker notebook

We have sample data in a csv file representing food sales. Load the data as spark data frame

The data contains 11 columns as shown below.

ANALYSIS

Pydeequ supports some quality set of metrics.

Analyzers have been used to extract some metrics out of the sample dataset. Analyzers can be used to extract metadata of the dataset for data profiling purposes as shown below

AnalysisRunner is used above to extract total number of rows in the dataset, Sum of Quantity, count of each region, Distinctness of ID and mean of TotalPrice.

Testing Data

Verifying various properties of the data is part of ensuring data quality, Checks are suitable to define multiple assertions on the data distribution as a part of the data pipeline to ensure that every dataset is of high quality and reliable for any data critical application.

Checks have been used above to make sure the following

ID column is unique.
Date column does not contain any non-negative values.
TotalPrice is greater than UnitPrice in every instance.
Region column only contains one of East, West, North or South.
All columns do not have null values.
CustomerEmail column contains an email at every instance
WebsiteURL contains URL at every instance.

PROFILING

Pydeequ can also be used in profiling data to observe trends at the data ingestion phase. Let’s consider food sales data over two days. Repository feature of pydeequ can be used to catalog the metadata of the source data. Various filtering functions of the repository feature can be used to filter metadata based on multiple constraints like date and time, tag, etc.

Let’s consider two source data files of food sales as shown below

Initialise metrics repository, Metrics Repository allows us to store the metrics in json format on the local disk (note that it also supports HDFS and S3).

Each set of metrics that we computed needs be indexed by a so-called ResultKey, which contains a timestamp and supports arbitrary tags in the form of key-value pairs. As shown below

Run the analyser with needed column metrics and save them to the repository as shown below

Apply same analyser on second day data as shown below

The metrics inside the repo can be observed as shown below

These metrics can be converted into a histogram as shown below

Pydeequ contains more functions which can extract various inter and intra-column metrics in data distributions to ensure integrity of data pipelines. More information can be found in https://pydeequ.readthedocs.io/en/latest/README.html.

At Daimlinc, we prioritize the integration of quality attributes in all our data pipeline projects. Whether we are building new pipelines, implementing data quality checks for existing pipelines, modernizing your current pipelines, or developing customized solutions for our valued customers, we ensure that quality is at the forefront of our work.

We have incredibly skilled Data & Analytics specialized consultants who can help you implement data and AI platforms from start to finish.

If you need assistance throughout the entire process of creating data pipelines that incorporate built-in data quality features or adding data quality checks to your existing pipelines, or if you want to modernize your pipelines with data quality checks tailored to your organization, our team of experts is here to help. Reach out to us for a comprehensive solution from start to finish, talk to one of our experts.

Published On: June 8th, 2023 / Categories: Analytics, Data /