In this blog, let’s look into evaluating source systems, storage systems, ingestion phase, batch vs stream ingestion & transformation
Data Engineering Lifecycle: Factors to consider during data engineering lifecycle
The data engineering lifecycle consists of stages that transform raw data into a valuable end result that analysts, data scientists, ML engineers, and others may consume.
We divide the data engineering lifecycle into five stages:
- Data generation
- Storage
- Ingestion
- Transformation
- Serving
There are few key things we need to consider during each stage.
Considerations
Evaluating source systems: Data Generation
- What are the data source’s essential characteristics? Is it an application or a swarm of Internet of Things devices etc. ?
- How is data persisted in the source system? Data is either persistent throughout time or it is temporary and rapidly erased.
- What is the rate of data generation? How many events per second? How many megabytes per hour?
- What level of consistency can data engineers expect from the output data? How frequently do data inconsistencies arise when you perform data-quality tests on the output data, nulls where they aren’t anticipated, bad formatting, and so on?
- How frequently do errors occur?
- Will there be duplication in the data?
- Will certain data values arrive late, perhaps much later than other messages generated concurrently?
- What is the schema of the data that was ingested? Will data engineers have to connect data from multiple tables or perhaps other systems to gain a complete picture of the data?
- How is it dealt with and communicated to downstream stakeholders when the schema changes (for example, the addition of a new column)?
- How often should data be retrieved from the source system?
- Is data delivered as periodic snapshots or update events from change data capture (CDC) for stateful systems (e.g., a database tracking customer account information)?
- What is the logic behind the modifications, and how are they tracked in the source database?
- Will reading from a data source have an effect on its performance?
- Is there any upstream data dependency on the source system?
- What qualities do these upstream systems have?
- Are there any data-quality checks in place to detect late or missing data?
Evaluating storage systems
When selecting a storage system for a data warehouse, data lakehouse, database, or object storage, here are a few key engineering questions to consider:
- Is this storage solution compatible with the write and read speeds required by the architecture?
- Is storage going to become a bottleneck for downstream processes?
- Do you fully understand how this storage technology operates?
- Are you making the best use of your storage system or committing unnatural acts?
- In an object storage system, for example, are you using a high rate of random access updates? This is an antipattern with a severe performance overhead.
- Can this storage system withstand expected future growth?
- Consider all capacity restrictions on the storage system, including total available storage, read operation rate, write volume, and so on. Will downstream users and processes be able to retrieve data within the service-level agreement (SLA) timeframe?
- Are you capturing schema evolution, data flows, data lineage, and so on?
- Metadata has a tremendous impact on data utility. Metadata is an investment in the future, significantly improving discoverability and institutional knowledge to expedite future initiatives and architectural changes. Is this purely a storage solution (object storage) or does it support advanced query patterns (cloud data warehouse)? Is the storage system schema independent (object storage)?
- A cloud data warehouse with an enforced schema?
- For data governance, how are you tracking master data, golden records data quality, and data lineage?
- How do you manage data sovereignty and regulatory compliance?
- For example, can you store your data in particular geographical locations but not others?
Data Ingestion Phase
Here are some key questions to ask while planning to create or design a data ingestion system:
- What applications have the data I’m ingesting?
- Can I reuse this data instead of producing multiple copies of the same dataset?
- Do the systems reliably produce and ingest this data, and is the data readily available when I need it?
- What happens to the data after it has been ingested?
- How often will I require access to the data?
- How much data will normally be delivered?
- How is the data formatted?
- Can this format be handled by my downstream storage and transformation systems?
- Is the source data suitable for use right away in my downstream applications? If so, how long will it last and what could make it unusable?
- Is a transformation necessary before the data reaches its destination?
- If using streaming to ingest data, would a transformation that takes place while the data is in flight, within the stream itself, be appropriate?
Batch vs Streaming
- Can downstream storage systems withstand the amount of data flow if I ingest the data in real time?
- Do I require real-time, millisecond data ingestion? Or would a microbatch strategy—adding and absorbing data, say, once per minute—work?
- What streaming ingestion use cases do I have?
- What unique advantages do I get from using streaming?
- What actions can I do with real-time data that are better than batch if I receive it?
- Will using a streaming-first strategy be more expensive than using batch processing in terms of time, money, maintenance, downtime, and lost opportunities?
- Are my streaming system and pipeline dependable and redundant in the event of infrastructure failure?
- Which tools would be most suitable for this use case?
- Should I utilize a cloud managed service?
- Do I have access to current production-level data? If so, how would my ingestion procedure affect this source system?
Transformation
It is helpful to take into account the following when thinking about data transformations within the data engineering lifecycle.
- What are the transformation’s costs and return on investment (ROI)?
- What is the business value connected with it?
- Is the transformation as straightforward and self-contained as it can be?
- What business rules are supported by the transformations?
- Am I reducing the amount of data that must be moved between the transformation and the storage system?
Reference: Fundamentals of Data Engineering
At Daimlinc , we strongly consider these factors when designing data pipelines for our customers.
We have incredibly skilled Data & Analytics specialized consultants who can help you implement data and AI platforms from start to finish.
If you’re looking to implement Datapipelines for your Organization and need someone to help you from start to finish, talk to one of our experts.