Building Batch Data Analytics Solutions on AWS – BBDA001
- Course Code : BBDA001
- Duration : 1 Day
- Price :708 GBP
- Level: Intermediate
- Language: English
Course Content
This Building Batch Data Analytics Solutions on AWS course equips you with the skills to design and implement batch data analytics solutions using Amazon EMR, AWS’s managed service for Apache Spark and Apache Hadoop. You’ll explore Amazon EMR’s integration with open-source tools like Apache Hive, Hue, and HBase, as well as AWS services such as AWS Glue and AWS Lake Formation. The course covers essential components of data pipelines, including collection, ingestion, cataloging, storage, and processing, with a focus on Spark and Hadoop. Additionally, you’ll use EMR Notebooks for analytics and machine learning workloads and apply best practices for security, performance, and cost management.
Delivery Method
- Online
Have questions about this course?
Goals
By the end of this course, you will be able to:
- Compare the features and benefits of data warehouses, data lakes, and modern data architectures.
- Design and implement effective batch data analytics solutions.
- Apply data storage optimization techniques, including compression.
- Select and deploy the appropriate tools for data ingestion, transformation, and storage.
- Choose the right instance types, clusters, auto-scaling options, and network topologies for various business scenarios.
- Understand the relationship between data storage, processing, and analytics for actionable business insights.
- Implement security measures for data at rest and in transit.
- Monitor and troubleshoot analytics workloads to ensure reliability.
- Use cost management best practices for efficient operations.
Pre Requisites
Participants should have:
- Completed the AWS Technical Essentials course.
- One year of experience building data analytics pipelines or completed the Data Analytics Fundamentals digital course.
Course Outline
Module A: Overview of Data Analytics and the Data Pipeline
- Explore data analytics use cases.
- Understand the role of data pipelines in analytics.
Module 1: Introduction to Amazon EMR
- Role of Amazon EMR in analytics solutions.
- Amazon EMR cluster architecture.
- Interactive Demo: Launching an Amazon EMR cluster.
- Cost management strategies for Amazon EMR.
Module 2: Data Analytics Pipeline Using Amazon EMR: Ingestion and Storage
- Techniques for optimizing data storage with Amazon EMR.
- Methods for data ingestion.
Module 3: High-Performance Batch Data Analytics Using Apache Spark on Amazon EMR
- Key use cases for Apache Spark on Amazon EMR.
- Apache Spark concepts and benefits in EMR.
- Interactive Demo: Connecting to an EMR cluster and using the Spark shell with Scala commands.
- Data transformation, processing, and analytics.
- Using EMR Notebooks for analytics workloads.
- Practice Lab: Conduct low-latency data analytics with Apache Spark on EMR.
Module 4: Processing and Analyzing Batch Data with Amazon EMR and Apache Hive
- Batch data processing with Hive on Amazon EMR.
- Transformation, processing, and analytics using Hive.
- Practice Lab: Batch data processing with Amazon EMR and Hive.
- Introduction to Apache HBase on Amazon EMR.
Module 5: Serverless Data Processing
- Serverless solutions for data processing, transformation, and analytics.
- Leveraging AWS Glue with Amazon EMR workloads.
- Practice Lab: Orchestrate Spark data processing with AWS Step Functions.
Module 6: Security and Monitoring of Amazon EMR Clusters
- Securing Amazon EMR clusters with best practices.
- Interactive Demo: Implementing client-side encryption with EMRFS.
- Monitoring and troubleshooting EMR clusters.
- Demo: Reviewing Apache Spark cluster history for performance insights.
Module 7: Designing Batch Data Analytics Solutions
- Explore batch data analytics use cases and best practices.