In traditional analytics architectures, onboarding new data or building new analytics pipelines typically necessitate extensive coordination across multiple teams such as business, data engineering, and data science and analytics. Before they can begin, these teams must first agree on requirements, schema, infrastructure capacity requirements, and workload management.
It is becoming increasingly difficult and inefficient to pre-define constantly changing schemas, and negotiating capacity slots on a shared infrastructure is time consuming. For these reasons, business users, data scientists, and analysts prefer simple, frictionless, and self-service options for constructing end-to-end data pipelines. Because machine learning (ML) and many analytics tasks are exploratory in nature, you must rapidly ingest new datasets and clean, normalise, and feature engineer them without worrying about operational overhead when considering the infrastructure that runs data pipelines.
A serverless data lake architecture enables agile and self-service data onboarding and analytics for all data consumer roles throughout an organisation. You can rapidly and interactively build data lakes and data processing pipelines using AWS serverless technologies to ingest, store, transform, and analyse petabytes of structured and unstructured data from batch and streaming sources, without needing to manage any storage or compute infrastructure.
The ingestion layer is in charge of getting data into the data lake. It allows you to connect to internal and external data sources using a variety of protocols. It is capable of ingesting both batch and streaming data into the storage layer. In addition, the ingestion layer is in charge of delivering ingested data to a variety of targets in the data storage layer, such as the object store, databases, and warehouses.
The storage layer is in charge of providing long-lasting, scalable, secure, and cost-effective components for storing massive amounts of data. It can store unstructured data as well as datasets in a variety of structures and formats. It allows you to store source data without first structuring it to conform to a target schema or format.
Components from all other layers integrate easily and natively with the storage layer. The storage layer is organised into the following zones to store data based on its consumption readiness for different personas across the organisation
- Raw Zone :
The raw zone is the area where components from the ingestion layer are stored. This is a temporary area where data is ingested as-is from sources.
Data engineering personas typically interact with the data stored in this zone.
- Cleaned zone :
Following the preliminary quality checks, the raw zone data is moved to the cleaned zone for permanent storage, where it is stored in its original format. The ability to replay downstream data processing in the event of errors or data loss in downstream storage zones is provided by permanently storing all data from all sources in the cleaned zone. Personas from data engineering and data science typically interact with the data stored in this zone.
- Curated zone :
This zone contains data that is ready for consumption and adheres to organisational standards and data models. Datasets in the curated zone are typically partitioned, catalogued, and stored in formats that allow the consumption layer to access them in a performant and cost-effective manner. After cleaning, normalising, standardising, and enriching data from the cleaned zone, the processing layer creates datasets in the curated zone. The data stored in this zone is used to drive business decisions by all personas across organisations.
Cataloging and search layer
The cataloguing and search layer stores business and technical metadata about datasets hosted in the storage layer. It allows for the tracking of schema and the granular partitioning of dataset information in the lake. It also includes mechanisms for tracking versions in order to keep track of changes to the metadata. As the number of datasets in the data lake increases, this layer makes them discoverable by providing search capabilities.
The processing layer is in charge of converting data into usable form via data validation, cleanup, normalisation, transformation, and enrichment. It is in charge of advancing dataset consumption readiness along the raw, cleaned, and curated zones, as well as registering metadata for cleaned and transformed data in the cataloguing layer. The processing layer is made up of purpose-built dataprocessing components that are tailored to the specific dataset characteristics and processing task at hand. The processing layer can handle large amounts of data and supports schema-on-read, partitioned data, and a variety of data formats. The processing layer also allows you to create and manage multi-step data processing pipelines that use purpose-built components for each step.
The consumption layer is in charge of providing scalable and performant tools for extracting insights from the data lake’s massive amount of data. It supports analytics across all personas across the organisation using a variety of purpose-built analytics tools that support analysis methods such as SQL, batch analytics, BI dashboards, reporting, and machine learning. The consumption layer is natively integrated with the storage, cataloguing, and security layers of the data lake. The consumption layer’s components support schemaon-read, a wide range of data structures and formats, and data partitioning for cost and performance optimisation.
Security and governance layer
The security and governance layer is in charge of safeguarding the data in the storage layer as well as the processing resources in all other layers. It includes mechanisms for access control, encryption, network security, monitoring, and auditing. The security layer also monitors the activities of all other layers’ components and generates a detailed audit trail. All other layers’ components provide native integration with the security and governance layers.
In the presented serverless architecture, the ingestion layer is made up of a collection of purpose-built AWS services that enable data ingestion from a variety of sources. Each of these services enables simple self-service data ingestion into the data lake landing zone and integrates with other AWS storage and security services. Individual AWS services are designed to meet the specific connectivity, data format, data structure, and data velocity requirements of operational database sources, streaming data sources, and file sources.
Operational Database Sources
Organizations typically store operational data in relational and NoSQL databases. AWS Data Migration Service (AWS DMS) can connect to various operational RDBMS and NoSQL databases and ingest data into Amazon Simple Storage Service (Amazon S3) buckets in the data lake landing zone. You can use AWS DMS to perform a one-time import of the source data into the data lake before replicating ongoing changes in the source database. AWS DMS encrypts S3 objects as they are stored in the data lake using AWS Key Management Service (AWS KMS) keys. AWS DMS is a fully managed, resilient service that offers a variety of instance sizes for hosting database replication tasks.
AWS Lake Formation offers a scalable, serverless alternative, known as blueprints, for ingesting data from AWS native or on-premises database sources into the data lake’s landing zone. A Lake Formation blueprint is a predefined template that generates an AWS Glue data ingestion workflow based on input parameters such as the source database, the target Amazon S3 location, the target dataset format, the target dataset partitioning columns, and the schedule. AWS Glue workflows generated from blueprints implement an optimised and parallelized data ingestion pipeline comprised of crawlers, multiple parallel jobs, and triggers that connect them based on conditions.
Streaming Data Sources
The ingestion layer receives streaming data from internal and external sources via Amazon Kinesis Data Firehose. You can create a Kinesis Data Firehose API endpoint where sources can send streaming data with a few clicks. Clickstreams, application and infrastructure logs, and monitoring metrics are examples of streaming data, as are IoT data such as device telemetry and sensor readings. Kinesis Data Firehose performs the following functions:
• It buffers incoming streams and batches, compresses, transforms, and encrypts them.
• Saves the streams as S3 objects in the data lake’s landing zone.
For real-time analytics use cases, Kinesis Data Firehose natively integrates with the security and storage layers and can deliver data to Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service (Amazon ES). Kinesis Data Firehose is serverless, requires no administration, and has a pricing model that charges you only for the amount of data you transmit and process through the service. Kinesis Data Firehose scales automatically to match the volume and throughput of incoming data.
Many applications store structured and unstructured data in files on Network Attached Storage (NAS) arrays. Data files are also received by organisations from partners and third-party vendors. Data analysis from these file sources can provide useful business insights.
Internal File Shares
AWS DataSync can load hundreds of terabytes and millions of files into the data lake landing zone from NFS and SMB enabled NAS devices. DataSync handles copy job scripting, scheduling and monitoring transfers, validating data integrity, and optimising network utilisation automatically. DataSync is capable of performing one-time file transfers as well as monitoring and syncing changed files into the data lake. DataSync is completely managed and can be set up in a matter of minutes.
Data files for partners
FTP is the most commonly used method for exchanging data files with partners. The AWS Transfer Family is a serverless, highly available, scalable service that supports secure FTP endpoints and integrates natively with Amazon S3. Partners and vendors send files via the SFTP protocol, and the AWS Transfer Family stores them as S3 objects in the data lake’s landing zone. The AWS Transfer Family supports AWS KMS encryption as well as common authentication methods such as AWS Identity and Access Management (IAM) and Active Directory.
APIs for data
Organizations today use SaaS and partner applications to support their business operations, such as Salesforce, Marketo, and Google Analytics. Analyzing SaaS and partner data alongside internal operational application data is critical for gaining a 360-degree view of the business. API endpoints for sharing data are frequently provided by partner and SaaS applications.
APIs for SaaS
Amazon AppFlow is used by the ingestion layer to easily ingest data from SaaS applications into the data lake. In Amazon AppFlow, you can create serverless data ingestion flows with a few clicks. Connecting to SaaS applications (such as Salesforce, Marketo, and Google Analytics) allows your flows to ingest data and store it in the data lake. You can schedule Amazon AppFlow data ingestion flows or trigger them by events in the SaaS application. Ingested data can be validated, filtered, mapped, and masked before storing in the data lake. Amazon AppFlow natively integrates with authentication, authorization, and encryption services in the security and governance layer.
APIs from partners
Organizations build or purchase custom applications that connect to APIs, fetch data, and create S3 objects in the landing zone using AWS SDKs to ingest data from partner and third-party APIs. These applications, as well as their dependencies, can be packaged into Docker containers and hosted on AWS Fargate. Fargate is a serverless compute engine for hosting Docker containers without the need for server provisioning, management, and scaling. Fargate natively integrates with AWS security and monitoring services to provide application containers with encryption, authorisation, network isolation, logging, and monitoring.
Amazon S3 serves as the foundation for our architecture’s storage layer.Amazon S3 provides virtually unlimited scalability at a low cost. S3 objects are used to store data, which is organised into raw, cleaned, and curated zone buckets and prefixes. AWS KMS keys are used to encrypt data in Amazon S3. Granular zone-level and dataset-level access to various users and roles is controlled by IAM policies.
Data of any structure (including unstructured data) and format can be stored as S3 objects without the need for any schema to be predefined. This enables services in the ingestion layer to quickly land a variety of source data formats into the data lake. Following data ingestion into the data lake, processing layer components can define schema on top of Amazon S3 datasets and register them in the cataloguing layer. Schema-on-read can then be used by services in the processing and consumption layers to apply the required structure to data read from S3 objects. Amazon S3 datasets are frequently partitioned to allow efficient filtering by services in the processing and consumption layers.
Cataloging and search layer
A data lake typically houses a large number of datasets with changing schema and new data partitions. To enable self-service data discovery in the data lake, a central data catalogue that manages metadata for all datasets in the data lake is critical. Separating metadata from data into a centralised schema also enables schema-on-read for the processing and consumption layer components.
Lake Formation serves as the central catalogue in the presented architecture, storing and managing metadata for all datasets hosted in the data lake. In Lake Formation, organisations manage all of their datasets’ technical metadata (such as versioned table schemas, partitioning information, physical data location, and update timestamps) as well as business attributes (such as data owner, data steward, column business definition, and column information sensitivity). AWS Glue, Amazon EMR, and Amazon Athena are natively integrated with Lake Formation and automate the discovery and registration of dataset metadata in the Lake Formation catalogue. Lake Formation also provides APIs for metadata registration and management via custom scripts and third-party products.
Lake Formation provides a centralised location for the data lake administrator to configure granular table and column level permissions for databases and tables hosted in the data lake.After granting Lake Formation permissions, users and groups can access only authorised tables and columns through a variety of processing and consumption layer services such as Athena, Amazon EMR, AWS Glue, and Amazon Redshift Spectrum.
The processing layer is made up of two types of components:
• Components that are used to build multi-step data processing pipelines.
• Components for orchestrating data processing pipelines on a predetermined schedule or in response to event triggers (such as ingestion of new data into the landing zone).
AWS Glue and AWS Step Functions are serverless components that allow you to build, orchestrate, and run pipelines that can easily scale to handle large data volumes.Multi-step workflows built with AWS Glue and Step Functions can catalogue, validate, clean, transform, and enrich individual datasets in the storage layer, progressing them from raw to cleaned and cleaned to curated zones.
AWS Glue is a serverless, pay-per-use ETL service that allows you to create and run Python or Spark jobs (written in Scala or Python) without having to deploy or manage clusters. AWS Glue generates code automatically to speed up your data transformation and loading processes. AWS Glue ETL is built on Apache Spark and provides out-of-the-box data source connectors, data structures, and ETL transformations to validate, clean, transform, and flatten data stored in many open-source formats such as CSV, JSON, Parquet, and Avro. AWS Glue ETL also has the ability to process partitioned data incrementally.
AWS Glue can also be used to define and run crawlers that can crawl data lake folders, discover datasets and their partitions, infer schema, and define tables in the Lake Formation catalogue. AWS Glue includes over a dozen built-in classifiers capable of parsing a wide range of data structures stored in open-source formats. AWS Glue also includes triggers and workflow capabilities that can be used to create multi-step end-to-end data processing pipelines with job dependencies and parallel steps. AWS Glue jobs and workflows can be scheduled or run on demand. AWS Glue integrates natively with AWS services in the storage, catalogue, and security layers.
The consumption layer is made up of fully managed, purpose-built analytics services that enable interactive SQL, BI dashboarding, batch processing, and machine learning.
Amazon Athena is an interactive query service that allows you to run complex ANSI SQL against terabytes of Amazon S3 data without first loading it into a database. Structured, semi-structured, and columnar data stored in open-source formats such as CSV, JSON, XML Avro, Parquet, and ORC can be analysed using Athena queries.
Athena applies schema-on-read to data read from Amazon S3 using table definitions from Lake Formation.
Batch analytics and data warehousing
Amazon Redshift is a fully managed data warehouse service that can host and process petabytes of data while simultaneously running thousands of high-performance queries. Amazon Redshift employs a cluster of compute nodes to power interactive dashboards and high-throughput batch analytics to drive business decisions.Amazon Redshift queries can be run directly on the Amazon Redshift console or submitted via the JDBC/ODBC endpoints provided by Amazon Redshift.
Amazon QuickSight offers serverless BI capabilities for quickly creating and publishing rich, interactive dashboards. QuickSight adds out-of-the-box, automatically generated ML insights like forecasting, anomaly detection, and narrative highlights to dashboards and visuals. QuickSight natively integrates with Amazon SageMaker to add 14 more custom ML model-based insights to your BI dashboards via Amazon Web Services AWS Serverless Data Analytics Pipeline. QuickSight dashboards can be accessed from any device via a QuickSight app, or they can be embedded in web applications, portals, and websites.
Predictive analytics and machine learning
Amazon SageMaker is a fully managed service that provides components for building, training, and deploying machine learning (ML) models through the use of an interactive development environment (IDE) called Amazon SageMaker Studio. In Amazon SageMaker Studio, you can upload data, create new notebooks, train and tune models, move back and forth between steps to adjust experiments, compare results, and deploy models to production, all in one place by using a unified visual interface. Amazon SageMaker also offers managed Jupyter notebooks that can be launched with a few clicks. Amazon SageMaker notebooks provide elastic compute resources, git integration, easy sharing, pre-configured ML algorithms, dozens of out-of-the-box ML examples, and AWS Marketplace integration, which enables easy deployment of hundreds of pre-trained algorithms.
Layers of security and governance
Components across all layers of the presented architecture protect data, identities, and processing resources by natively utilising the security and governance layer’s following capabilities.
Authorization and authentication
AWS Identity and Access Management (IAM) provides users with user, group, and role-level identity, as well as the ability to configure fine-grained access control for resources managed by AWS services across all layers of our architecture. Through integrations with corporate directories and open identity providers such as Google, Facebook, and Amazon, IAM supports multi-factor authentication and single sign-on.
Lake Formation provides a simple and centralised authorisation model for data lake tables. After being implemented in Lake Formation, database and table authorisation policies are enforced by other AWS services such as Athena, Amazon EMR, QuickSight, and Amazon Redshift Spectrum. You can grant or revoke database-, table-, or column-level access for IAM users, groups, or roles defined in Lake Formation in the same AWS account that hosts the Lake Formation catalogue or in a different AWS account Lake Formation’s simple grant/revoke-based authorisation model greatly simplifies the previous IAM-based authorisation model, which relied on securing S3 data objects and metadata objects in the AWS Glue Data Catalog separately.
AWS KMS enables the creation and management of symmetric and asymmetric customer-managed encryption keys. AWS services are natively integrated with AWS KMS in all layers of our architecture to encrypt data in the data lake. It allows you to create new keys as well as import existing customer keys. IAM is used to control access to the encryption keys, which is monitored through detailed audit trails in CloudTrail.
Amazon Virtual Private Cloud (Amazon VPC) is used in our architecture to provision a logically isolated section of the AWS Cloud (called VPC) that is isolated from the internet and other AWS customers. You can choose your own IP address range, create subnets, and configure route tables and network gateways with Amazon VPC. Other layers of our architecture’s AWS services launch resources in this private VPC to protect all traffic to and from these resources.
Monitoring & Logging
AWS services in all layers of our architecture are monitored and logged, and detailed logs and monitoring metrics are stored in Amazon CloudWatch. CloudWatch allows you to analyse logs, visualise monitored metrics, set monitoring thresholds, and receive alerts when those thresholds are exceeded.
CloudTrail stores extensive audit trails of user and service actions for all AWS services in our architecture. CloudTrail keeps track of your AWS account activity, including actions taken via the AWS Management Console, AWS SDKs, command line tools, and other AWS services. This event history streamlines security analysis, tracking resource changes, and troubleshooting. You can also use CloudTrail to detect unusual activity in your AWS accounts. These capabilities aid in the simplification of operational analysis and troubleshooting.
You can build a modern, low-cost data lake analytics architecture in days using AWS serverless and managed services. To address new requirements and data sources, a decoupled, component-driven architecture allows you to start small and quickly add new purpose-built components to one of six architecture layers.
How DAIMLINC can help
If you’re interested in building a POC around an AWS serverless data analytics pipeline or want to develop a production-ready pipeline, talk to one of our experts.