Automated data lineage tracking using Spline for Databricks Spark ETL Jobs

In today’s data-driven world, understanding the journey of data from its origin to its final destination is vital for ensuring data quality, regulatory compliance, and optimizing data workflows. Data lineage, the process of tracking and documenting data flow, plays a pivotal role in achieving these goals. In this blog, we will explore Spline, an open-source data lineage tool, and its integration with Databricks, a leading data analytics platform, to enable seamless data tracking and governance.

Understanding Data Lineage:

Data lineage provides a comprehensive view of data movements, transformations, and dependencies throughout the data ecosystem. It allows data engineers, data scientists, and business analysts to visualize how data is ingested, processed, and delivered across the organization’s data pipelines.

Introducing Spline and Databricks:

Spline:

Spline is an open-source data lineage tool developed by Absa Group. It focuses on providing an automated and real-time data lineage solution for Apache Spark-based data processing workflows.

Databricks:

Databricks is a cloud-based unified analytics platform that simplifies big data analytics and AI tasks. It enables data engineers and data scientists to collaborate effectively and run data workflows at scale.

Benefits of Using Spline with Databricks:

Comprehensive Data Lineage: The integration of Spline with Databricks empowers users to visualize end-to-end data lineage, providing insights into data sources, transformations, and destinations within Databricks-based data pipelines.

Real-Time Insights: Spline captures data lineage in real-time, enabling users to monitor data transformations as they happen. This feature facilitates timely debugging and troubleshooting.

Enhanced Data Governance: By combining Databricks’ data processing capabilities with Spline’s data lineage tracking, organizations can establish robust data governance frameworks, ensuring data quality, compliance, and security.

Optimized Data Workflows: Understanding data lineage in Databricks allows teams to identify bottlenecks and performance issues in data pipelines, leading to optimized data workflows.

Collaborative Data Operations: Spline’s integration with Databricks fosters better collaboration between data teams, data scientists, and business analysts, promoting a data-driven culture.

Getting Started with Spline on EC2 instance and Databricks:

Daimlinc practices industry-standard security protocols in any solution deployment, below is the architecture diagram for rudimentary deployment of spline and Databricks. Optionally we can also launch this in a private subnet for orgs who wish to access this privately

Open your AWS Console and select EC2 to Launch an instance, Below are the steps followed to launch the ec2 instance

Select an Amazon Machine Image of your choice (we’ll use a default Amazon Linux 2 AMI)

When choosing an instance type consider the amount of RAM and disk space. Three Docker containers needed to be run in total:

Arango DB – this is where the lineage data will be stored.
Spline REST Gateway – a Java application that exposes an API for Spline agents and the Spline UI. It runs on a Tomcat server and can be memory intensive.
Spline UI – a lightweight HTTP server that is only used for serving static resources required by the Spline UI. Spline UI is implemented as a Single Page Application (SAP) that runs entirely within the browser and communicates directly with the Spline Gateway via the REST API. It does not route any additional HTTP traffic through its own server.

For demonstration purposes all three containers will be run on the same EC2 instance, so t2.medium instance with 4Gb RAM and 2 CPUs will be suitable.

This EC2 instance is launched in public subnet due to Spline UI accessibility. Security group associated with this instance must have following inbound and outbound rules

These inbound rules make sure that the instance is reachable from internet via certain ports as shown in the above image.

The above outbound rules make sure that the traffic of the instance can reach the internet.

EC2 Instance Connect is a secure way to connect to your Amazon EC2 instances using Secure Shell (SSH). It simplifies the process of connecting to your instances while enhancing security by using AWS Identity and Access Management (IAM) roles.

This launches an in-browser terminal used to communicate with EC2 instance as shown below

Then install and start the Docker service by running below code

Re-open the EC2 instance connect browser to apply the newly added docker group.

Install docker compose by running below commands

Download Spline demo Docker-compose config files:

Run docker compose like below. DOCKER_HOST_EXTERNAL is the external IP of this EC2 instance. This IP will be passed to the Spline UI and used by the client browser to connect to the Spline REST API.

When the containers are up we can verify that the Spline Gateway and Spline UI servers are running by visiting the following URLs:

http://your-ec2-instance public ipv4 address:8080/
http://your-ec2-instance public ipv4 address:9090/

Follow steps in the below link to run spline on databricks instance

https://github.com/AbsaOSS/spline-getting-started/tree/main/spline-on-databricks

After configuring spline on Databricks instance

Run the below code in Scala to create and save two sample files along with reading those files, joining them and writing the result into another file using append mode:

The above code will result in below result in Spline UI,The three executions corresponds to three writes creating two data sources named ‘addresses’ and ‘students’, and one merge-overwrite job joining ‘students’ and ‘addresses’ on ID to overwrite in append mode.

Below image shows the high-level lineage of the data produced by the current execution event. It shows how the data flew between the data sources and what jobs were involved in the process.

The below image shows, the transformations have been applied on the data, the operation details, input/output data types etc.

Conclusion:

Data lineage is an indispensable aspect of modern data management, ensuring data reliability, regulatory compliance, and optimal data processing. By integrating Spline, an open-source data lineage tool, with Databricks, a powerful data analytics platform, organizations can unlock the full potential of their data assets.

The combination of Spline’s automated data lineage tracking and Databricks’ scalable data processing capabilities empowers data teams to gain real-time insights into their data pipelines, collaborate efficiently, and make data-driven decisions. Together, Spline and Databricks pave the way for organizations to achieve data excellence and drive innovation in their data-driven initiatives.

Follow below link for more information on spline

https://absaoss.github.io/spline/

At Daimlinc, we excel at delivering data solutions that are driven by a deep commitment to quality and transparency. Our expertise lies in seamlessly integrating cutting-edge technical and data lineage tools into all our data pipeline projects. Whether it’s designing and building new pipelines, enhancing the data lineage extraction process for existing systems, modernizing your current data infrastructure, or creating bespoke solutions tailored to your unique needs, we go above and beyond to ensure that your specific requirements around technical lineage extraction are not just met, but exceeded.

Our dedicated team employs a variety of techniques to ensure that your data lineage needs are comprehensively addressed. With us, you can trust that your data’s journey is transparent, reliable, and fully aligned with your expectations. We’re not just data experts; we’re your trusted partners in achieving data excellence. Let’s work together to make your data pipelines robust, efficient, and fully accountable.

We have incredibly skilled Data & Analytics specialized consultants who can help you implement data and AI platforms from start to finish.

If you’re seeking assistance at any stage of the data pipeline creation process, whether it’s building pipelines from the ground up with comprehensive integration of data lineage tools or modernizing your existing pipelines with seamless data lineage integration, our team of seasoned experts is here to provide the support you need. Reach out to us for a holistic solution that covers every aspect from inception to implementation,talk to one of our experts.

Published On: September 17th, 2023 / Categories: Analytics, Data /

Automated data lineage tracking using Spline for Databricks Spark ETL Jobs

Grow Your Business Strategically. Leverage Our Expertise on AWS & Databricks.

Resources

Company

Policies

Automated data lineage tracking using Spline for Databricks Spark ETL Jobs

Related Posts

Enforcing data quality in data pipelines with PyDeequ

Delta – Best Practices for Managing Performance

Data Engineering Lifecycle Considerations

Databricks Delta Live Tables

Grow Your Business Strategically. Leverage Our Expertise on AWS & Databricks.

Resources

Company

Policies