Cloud Devops Data Engineering

    •  Master the art of data engineering on AWS and build powerful data pipelines.

Created by Arvind Agrawal

  • English

About the course

Hadoop Bigdata Cloud devops training Syllabus

Module1: Hadoop Eco System components

(HDFS,Mapreduce,hive,sqoop, hbase ,flume)

===========================================================

HDFS

  • 1.Architecture
  • 2.hdfs features
  • 3.read and write operations in hdfs
  • 4.hdfs developer commands,hdfs admin commands
  • 5. hdfs data blocks
  • 6. rack awareness
  • 7. high availability
  • 8. Fault Tolerance
  • 9.Name Node high Availability
  • 10. hdfs Federation

MapReduce

  • 1.Introduction
  • 2.Architecture
  • 3.Mapper ,shuffle and sort ,Reducer
  • 4.Key-value pairs
  • 5.input format,input split, Record reader,output format
  • 6.Partitioner,Combiner
  • 7.Map Side join ,Reduce side join,distributed Cache
  • 8.Counter
  • 9.performance tuning in mapreduce

Hive

  • 1.Introduction
  • 2.Architecture
  • 3.Built-In-Function
  • 4.UDFs(UDF,UDAF,UDTF)
  • 5.DDL commands(CREATE,SHOW,DESCRIBE,USEDROP,ALTER,TRUNCATE)
  • 6.DML Commands(LOAD,SELECT,INSERT,DELETE,UPDATE,EXPORT,IMPORT)7.Apache Hive View and Hive Index
  • 8.Hive Metastore – Different Ways to Configure Hive Metastore
  • 9.Hive Data Model – Table,Partition,Bucket
  • 10.Hive Data Types – Primitive and Complex Data Types in Hive - Complex data types - Array ,Struct,Map
  • 11.Hive Operators –Relational Operators,Arithmetic Operators,Logical Operators,String Operators,Operators on Complex Types
  • 12.Hive SerDe – Custom & Built-in SerDe in Hive
  • JsonSerde,OpenCSVSerde,ParquetSerde,OrcSerde,XmlSerde,RegexSerde
  • 13.Hive Partitions, Types of Hive Partitioning with Examples
  • Static Partitioning,Dynamic Partitioning
  • 13.Bucketing in Hive – Creation of Bucketed Table in Hive
  • 14.Hive Join | HiveQL Select Joins Query | Types of Join in Hive
  • Inner,Left Outer Join,Right Outer Join,Full Outer Join,Self join,Cross Join,Map Join,Bucket Map Join,Skew Join,Sort Merge Bucket Join
  • 15. Internal vs External Table
  • 16. configure Mysql metastore
  • 17. HQL select statements,Group By,Having,Grouping Sets ,Rollup and Cube,Order By Query,Sort by ,clustered by
  • 18. Windows function- Row_number,rank,dense_rank(),lead(),lag(),first_value(),last_value()
  • 19.Hive Optimization Techniques – Hive Performance
  • 20.hive Security - Authentication,Authorization,Encryption
  • 21 Hive Transaction management

Sqoop

  • 1.sqoop architecture
  • 2.sqoop features
  • 3.sqoop eval
  • 4.sqoop import
  • 5.sqoop import-all tables
  • 6.sqoop validation
  • 7.sqoop export
  • 8.sqoop incremental jobs
  • 9.sqoop jobs
  • 10.sqoop codegen
  • 11.sqoop merge
  • 12.sqoop metastore
  • 13.sqoop list-databases
  • 14.sqoop list-tables
  • 15.sqoop connectors & drivers
  • 16.Import Mainframe
  • 17.Hcatalog Integration
  • 18.Troubleshooting issue in sqoop
  • 19.sqoop performance tuning
  • Hbase
  • 1.Hbase Architecture
  • 2.Hbase features
  • 3.Hbase Use Cases
  • 4.Hbase operations
  • 5.Hbase commands
  • 6.Table Management Commands in HBase
  • 7.Data Manipulation HBase Command – Create, Truncate, Scan
  • 8.HBase Admin API – Class Descriptor & Class HBaseAdmin
  • 9.HBase Client API – HTable, Put, Get, Delete, Result
  • 10.HBase MemStore – Uses, Benefits & Configuration
  • 11.HBase Security: Kerberos Authentication & Authorization
  • 12.HBase vs RDBMS: Feature Wise Comparison
  • 13.HBase vs Impala: Compare Which is Better
  • 14.HBase Troubleshooting – Problem, Cause & Solution
  • 15.HBase Performance Tuning | Ways For HBase Optimization

 

Module 2 - Spark Core Components:

spark ->(spark core ,sql,Streaming,mysql integration,Mongodb,Cassandra,snowflakes,ElasticSearch,Sparkkafka streaming,Hbase integration)

Spark-Apache Spark Core

  • 1.Spark Introduction
  • 2.Apache Spark Ecosystem – Complete Spark Component
  • 3.Features of Apache Spark – Learn the benefits of using Spark
  • 4.Apache Spark Use Cases in Real Time
  • 5.Spark Shell Commands to Interact with Spark-Scala
  • 6.Spark Shell Commands to Interact with Spark-python
  • 7.Learn SparkContext,SparkSession – Introduction and Functions
  • 8.Spark Stage,tasks- An Introduction to Physical Execution plan
  • 9.Spark RDD – Introduction, Features & Operations of RDD
  • 10.RDD Persistence and Caching Mechanism in Apache Spark
  • 11.Shining Features of Spark RDD You Must Know
  • 12.Introduction to Apache Spark Paired RDD
  • 13.How to Overcome the Limitations of RDD in Apache Spark?
  • 14.Spark RDD Operations-Transformation & Action with Example
  • 15.RDD lineage in Spark: ToDebugString Method
  • 16.Apache Spark Map vs FlatMap Operation
  • 17.Spark In-Memory Computing – A Beginners Guide
  • 18.Lazy Evaluation,Fault tolerance ,Directed Acyclic Graph DAG in Apache Spark
  • 19.Apache Spark Cluster Managers – YARN, Mesos & Standalone , how it works
  • 20.Spark Performance Tuning-Learn to Tune Apache Spark Job
  • Spark- Apache Spark SQL
  • 1.Apache Spark SQL Tutorial – Quick Introduction Guide
  • 2.Spark SQL Features
  • 3.Spark SQL DataFrame
  • 4.Spark Dataset
  • 5.Spark SQL Optimization – Understanding the Catalyst Optimizer
  • 6.Apache Spark RDD vs DataFrame vs DataSet
  • 7.Spark Mysql integration
  • 8.spark Hive Integration
  • 9.Spark MongoDB integration(including MongoDB Hands on)
  • 10.Spark Cassandra Integration(including Cassandra Hands On)
  • 12.Spark Hbase Integration
  • 13.Spark ElasticSearch Integration(including Elasticsearch Hands On)
  • 14.Spark joining strategies,Spark joins(inner join,left outer ,right outer join,self join ,cross join,full outer join),skew join,broadcast join
  • 15.Spark storage format(Parquet,avro and ORC)
  • 16.Spark dataframe api for window function(row_number,rank,dense_rank,lead,lag,first_value and last_value)
  • 17.spark sql api

Spark-Apache Spark Streaming

  • 1.Spark streaming introduction
  • 2.Apache Spark DStream (Discretized Streams)
  • 3.Apache Spark Streaming Transformation Operations
  • 4.Spark Streaming Checkpoint in Apache Spark
  • 5.Spark Watermarking Checkpoint in Apache Spark
  • 6.Spark Kafka Integration

Module 3 - Big data Cloud Technologies:

AWS Services:-

aws data engineer -> (lambda,glue,emr, kinesis,dynamodb,RDS,EC2,S3,Redshift)

======================================================

1.Big Data on AWS Introduction

  • Cloud Computing Introduction, Advantages, and Types
  • Cloud Deployment Models
  • Cloud Service Categories
  • AWS Cloud Platform
  • AWS Cloud Architecture Design Principles - Part I
  • AWS Cloud Architecture Design Principles - Part II
  • Why AWS for Big Data - Reasons and Challenges

2.Databases in AWS

  • Data Warehousing in AWS
  • Redshift, Kinesis, and EMR
  • DynamoDB, Machine Learning, and Lambda
  • ElasticSearch Services and EC2

3.Big Data on AWS - Collection

  • Amazon Kinesis and Kinesis Stream
  • Kinesis Data Stream Architecture and Core Components
  • Data Producer
  • Data Consumer
  • Kinesis Stream Emitting Data to AWS Services and Kinesis Connector Library
  • Kinesis Firehose
  • Demo - Put and Get Records from Kinesis Data Stream
  • Transferring Data Using Lambda
  • Amazon SQS Lifecycle and Architecture
  • IoT and Big Data
  • IoT Framework
  • AWS Data Pipelines and Data Nodes
  • Activity, Pre-Condition, and Schedule
  • Demo - Importing Data from S3 into DynamoDB Using Data Pipeline

4.Big Data on AWS - Storage

  • Amazon Glacier and Big Data
  • DynamoDB Introduction
  • DynamoDB and EMR
  • DynamoDB Partitions and Distributions
  • DynamoDB GSI LSI
  • DynamoDB Stream and Cross-Region Replication
  • DynamoDB Performance and Partition Key Selection
  • Snowball and AWS Big Data
  • AWS DMS
  • AWS Aurora in Big Data
  • Demo - Amazon Athena Interactive SQL Queries for Data in Amazon S3 Part I
  • Demo - Amazon Athena Interactive SQL Queries for Data in Amazon S3 Part II

5.Big Data on AWS - Processing

  • Amazon EMR
  • Demo - Analyzing Big Data with Amazon EMR
  • Apache Hadoop
  • EMR Architecture
  • EMR Operations - Releases and Cluster
  • EMR Operations - Choosing Instance and Monitoring
  • Demo - Advanced EMR Setting Options
  • Hive on EMR
  • HBase with EMR
  • Presto with EMR
  • Spark with EMR
  • EMR File Storage
  • Demo - Analyzing Large Datasets Using Hive and Spark
  • AWS Lambda

6.Big Data on AWS - Analysis

  • Redshift Intro and Use Cases
  • Redshift Architecture
  • MPP and Redshift in AWS Ecosystem
  • Columnar Databases
  • Redshift Table Design - Part I
  • Redshift Table Design - Part II
  • Demo - Generating Random Dataset in EC2 and Loading it in S3
  • Demo - Redshift Maintenance and Operations
  • Machine Learning Introduction
  • Machine Learning Algorithm
  • Amazon SageMaker
  • Amazon Elasticsearch
  • Amazon Elasticsearch Services
  • Demo - Loading Datasets into Elasticsearch
  • Logstash and RStudio
  • Demo - Fetching the File and Analyzing it using RStudio
  • Athena
  • Demo - Running Query on S3 using the Serverless Athena
  • Demo - Creating a Redshift Cluster and Loading the Datasets into it from S3 - Part I
  • Demo - Creating a Redshift Cluster and Loading the Datasets into it from S3 - Part II

7.Big Data on AWS - Visualization

  • Amazon QuickSight
  • Demo - Creating an Analysis with a Single Visual using Sample Data
  • Demo - Creating an Analysis using Your Own Amazon S3 Data
  • Big Data Visualization

8.Big Data on AWS - Security

  • EMR Security and Security Group
  • Roles and Private Subnet
  • Encryption at Rest and In-Transit
  • Redshift Security
  • Encryption at Rest using CloudHSM
  • Cloud HSM versus AWS KMS
  • Limit Data Access

Azure Services:-

azure data engineer  -> (azure functions, azure blob storage, azure data factory ,azure databricks and azure synapse ,azure event hub)

======================

1.Working with Azure Blob Storage

2.Working with Relational Databases in Azure

  • Provisioning and connecting to an Azure SQL database using PowerShell
  • Provisioning and connecting to an Azure PostgreSQL database using the Azure CLI
  • Provisioning and connecting to an Azure MySQL database using the Azure CLI
  • Implementing active geo-replication for an Azure SQL database using PowerShell
  • Implementing an auto-failover group for an Azure SQL database using PowerShell
  • Implementing vertical scaling for an Azure SQL database using PowerShell
  • Implementing an Azure SQL database elastic pool using PowerShell
  • Monitoring an Azure SQL database using the Azure portal

3.Analyzing Data with Azure Synapse Analytics

  • Provisioning and connecting to an Azure Synapse SQL pool using PowerShell
  • Pausing or resuming a Synapse SQL pool using PowerShell
  • Scaling an Azure Synapse SQL pool instance using PowerShell
  • Loading data into a SQL pool using PolyBase with T-SQL
  • Loading data into a SQL pool using the COPY INTO statement
  • Implementing workload management in an Azure Synapse SQL pool
  • Optimizing queries using materialized views in Azure Synapse Analytics

4.Control Flow Activities in Azure Data Factory

  • Implementing control flow activities
  • Implementing control flow activities – Lookup and If activities
  • Triggering a pipeline in Azure Data Factory

5.Control Flow Transformation and the Copy Data Activity in Azure Data Factory

  • Implementing HDInsight Hive and Pig activities
  • Implementing an Azure Functions activity
  • Implementing a Data Lake Analytics U-SQL activity
  • Copying data from Azure Data Lake Gen2 to an Azure Synapse SQL pool copy activity
  • Copying data from Azure Data Lake Gen2 to Azure Cosmos DB using the copy activity

6.Data Flows in Azure Data Factory

  • Implementing incremental data loading with a mapping data flow
  • Implementing a wrangling data flow

7.Azure Data Factory Integration Runtime

  • Configuring a self-hosted IR
  • Configuring a shared self-hosted IR
  • Migrating an SSIS package to Azure Data Factory
  • Executing an SSIS package with an on-premises data store

8.Deploying Azure Data Factory Pipelines

  • Configuring the development, test, and production environments
  • Deploying Azure Data Factory pipelines using the Azure portal and ARM templates
  • Automating Azure Data Factory pipeline deployment using Azure DevOps

9.Batch and Streaming Data Processing with Azure Databricks

  • Configuring the Azure Databricks environment
  • Transforming data using Python
  • Transforming data using Scala
  • Working with Delta Lake

10. Processing structured streaming data with Azure Databricks

GCP Services:-

GCP data engineer  -> (Gcp data proc, pub sub, apache beam , composer,Gcp sql data storages, Big query and nosql database)

GCP Big Data engineer

======================================================

1.Getting Started with Data Engineering with GCP

2.Big Data Capabilities on GCP

  • Understanding what the cloud is
  • Getting started with Google Cloud Platform
  • A quick overview of GCP services for data engineering

3.Building Solutions with GCP Components

4. Building a Data Warehouse in BigQuery

  • Introduction to Google Cloud Storage and BigQuery
  • Introduction to the BigQuery console
  • Preparing the prerequisites before developing our data warehouse
  • Practicing developing a data warehouse

5.Building Orchestration for Batch Data Loading Using Cloud Composer

  • Introduction to Cloud Composer
  • Understanding the working of Airflow
  • Exercise: Build data pipeline orchestration using Cloud Composer

6.Building a Data Lake Using Dataproc

  • Introduction to Dataproc
  • Exercise – Building a data lake on a Dataproc cluster
  • Exercise: Creating and running jobs on a Dataproc cluster
  • Understanding the concept of the ephemeral cluster
  • Building an ephemeral cluster using Dataproc and Cloud Composer

7.Processing Streaming Data with Pub/Sub and Dataflow

  • Processing streaming data
  • Exercise – Publishing event streams to cloud Pub/Sub
  • Exercise – Using Cloud Dataflow to stream data from Pub/Sub to GCS

8.Visualizing Data for Making Data-Driven Decisions with Data Studio

  • Unlocking the power of your data with Data Studio
  • From data to metrics in minutes with an illustrative use case
  • Understanding how Data Studio can impact the cost of BigQuery
  • How to create materialized views and understanding how BI Engine works

9.Key Strategies for Architecting Top-Notch Data Pipelines

10.User and Project Management in GCP

  • Understanding IAM in GCP
  • Planning a GCP project structure
  • Controlling user access to our data warehouse
  • Practicing the concept of IaC using Terraform

11.Cost Strategy in GCP

  • Estimating the cost of your end-to-end data solution in GCP
  • Tips for optimizing BigQuery using partitioned and clustered tables

12.CI/CD on Google Cloud Platform for Data Engineers

  • Introduction to CI/CD
  • Understanding CI/CD components with GCP services
  • Exercise – implementing continuous integration using Cloud Build
  • Exercise – deploying Cloud Composer jobs using Cloud Build

Module 4 - Snowflakes,Airflow,NIFI

Snowflakes

=================================================

1. Getting Started with Snowflake

  • Creating a new Snowflake instance
  • Creating a tailored multi-cluster virtual warehouse
  • Using the Snowflake WebUI and executing a query
  • Using SnowSQL to connect to Snowflake
  • Connecting to Snowflake with JDBC
  • Creating a new account admin user and understanding built-in roles

2.Managing the Data Life Cycle

  • Managing a database
  • Managing a schema
  • Managing tables
  • Managing external tables and stages
  • Managing views in Snowflake

3.Loading and Extracting Data into and out of Snowflake

  • Configuring Snowflake access to private S3 buckets
  • Loading delimited bulk data into Snowflake from cloud storage
  • Loading delimited bulk data into Snowflake from your local machine
  • Loading Parquet files into Snowflake
  • Making sense of JSON semi-structured data and transforming to a relational view
  • Processing newline-delimited JSON (or NDJSON) into a Snowflake table
  • Processing near real-time data into a Snowflake table using Snowpipe
  • Extracting data from Snowflake

4.Building Data Pipelines in Snowflake

  • Creating and scheduling a task
  • Conjugating pipelines through a task tree
  • Querying and viewing the task history
  • Exploring the concept of streams to capture table-level changes
  • Streams and Tasks to build pipelines that process changed data on a schedule
  • Converting data types and Snowflake's failure management
  • Managing context using different utility functions

5.Data Protection and Security in Snowflake

  • Setting up custom roles and completing the role hierarchy
  • Configuring and assigning a default role to a user
  • Delineating user management from security and role management
  • Configuring custom roles for managing access to highly secure data
  • Setting up development, testing, pre-production, and production database hierarchies and roles
  • Safeguarding the ACCOUNTADMIN role and users in the ACCOUNTADMIN role

6.Performance and Cost Optimization

  • Examining table schemas and deriving an optimal structure for a table
  • Identifying query plans and bottlenecks
  • Weeding out inefficient queries through analysis
  • Identifying and reducing unnecessary Fail-safe and Time Travel storage usage
  • Projections in Snowflake for performance
  • Reviewing query plans to modify table clustering
  • Optimizing virtual warehouse scale

7.Secure Data Sharing

  • Sharing a table with another Snowflake account
  • Sharing data through a view with another Snowflake account
  • Sharing a complete database with another Snowflake account and setting up future objects to be shareable
  • Creating reader accounts and configuring them for non-Snowflake sharing
  • Keeping costs in check when sharing data with non-Snowflake users

8.Back to the Future with Time Travel

  • Using Time Travel to return to the state of data at a particular time
  • Using Time Travel to recover from the accidental loss of table data
  • Identifying dropped databases, tables, and other objects and restoring them using Time Travel
  • Using Time Travel in conjunction with cloning to improve debugging
  • Using cloning to set up new environments based on the production environment rapidly

9.Advanced SQL Techniques

  • Managing timestamp data
  • Shredding date data to extract Calendar information
  • Unique counts and Snowflake
  • Managing transactions in Snowflake
  • Ordered analytics over window frames
  • Generating sequences in Snowflake

10.Extending Snowflake Capabilities

  • Creating a Scalar user-defined function using SQL
  • Creating a Table user-defined function using SQL
  • Creating a Scalar user-defined function using JavaScript
  • Creating a Table user-defined function using JavaScript
  • Connecting Snowflake with Apache Spark
  • Using Apache Spark to prepare data for storage on Snowflake

NIFI & Airflow

===================================================

1.Building Data Pipelines – Extract Transform, and Load

2.Building Our Data Engineering Infrastructure

  • Installing and configuring Apache NiFi
  • Installing and configuring Apache Airflow
  • Installing and configuring Elasticsearch
  • Installing and configuring Kibana
  • Installing and configuring PostgreSQL
  • Installing pgAdmin 4

3.Reading and Writing Files

  • Writing and reading files in Python
  • Building data pipelines in Apache Airflow
  • Handling files using NiFi processors

4.Working with Databases

  • Inserting and extracting relational data in Python
  • Inserting and extracting NoSQL database data in Python
  • Building data pipelines in Apache Airflow
  • Handling databases with NiFi processors

5.Cleaning, Transforming, and Enriching Data

  • Performing exploratory data analysis in Python
  • Handling common data issues using pandas
  • Cleaning data using Airflow

6.Deploying Data Pipelines in Production

7.Features of a Production Pipeline

  • Staging and validating data
  • Building idempotent data pipelines
  • Building atomic data pipelines

8.Version Control with the NiFi Registry

  • Installing and configuring the NiFi Registry
  • Using the Registry in NiFi
  • Versioning your data pipelines
  • Using git-persistence with the NiFi Registry

9.Monitoring Data Pipelines

  • Monitoring NiFi using the GUI
  • Monitoring NiFi with processors
  • Using Python with the NiFi REST API

10.Deploying Data Pipelines

  • Finalizing your data pipelines for production
  • Using the NiFi variable registry
  • Deploying your data pipelines

11.Building a Production Data Pipeline

  • Creating a test and production environment
  • Building a production data pipeline
  • Deploying a data pipeline in production

12.Beyond Batch – Building Real-Time Data Pipelines

13.Streaming Data with Apache Kafka

  • Understanding logs
  • Understanding how Kafka uses logs
  • Building data pipelines with Kafka and NiFi
  • Differentiating stream processing from batch processing
  • Producing and consuming with Python

14.Data Processing with Apache Spark

  • Installing and running Spark
  • Installing and configuring PySpark
  • Processing data with PySpark

Module 5 - Devops

devops ->(git,jenkins,docker and kubernetes)(spark ,kafka ,airflow ,hadoop pipeline using docker and kubernetes ,helm chart, Terraform)

GIT

  •     GIT Features
  •     3-Tree Architecture
  •     GIT – Clone /Commit / Push
  •     GIT revert and reset
  •     GIT Branching strategies
  •     GIT Rebase & Merge
  •     GIT Stash, Reset, Checkout
  •     GIT Clone, Fetch, Pull

Jenkins

  • Introduction to Jenkins
  • Continuous Integration with Jenkins
  • Configure Jenkins
  • Jenkins Management
  • Scheduling build Jobs
  • POLL SCM
  • Build Periodically
  • Maven Build Scripts
  • Support for the GIT version control System
  • Different types of Jenkins Jobs
  • Jenkins Build PipeLine
  • Parent and Child Builds
  • Sequential Builds
  • Jenkins Master & Slave Node Configuration

Docker

  •     How to get Docker Image?
  •     What is Docker Image
  •     Docker Installation
  •     Working with Docker Containers
  •         What is Container
  •         Docker Engine
  •         Crating Containers with an Image
  •         Working with Images
  •     Docker Command Line Interphase
  •     Docker Compose
  •     Docker Hub
  •     Docker Trusted Registry
  •     Docker swarm
  •     Docker attach
  •     Docker File & Commands
  •    Docker containers for kafka ,spark,cassandra etl pipeline

Kubernetes:

  • Kubernetes Introduction
  • Kubernetes Architecture
  • Kubernetes Setup (Self Managed,AWS managed)
  • Kubernetes Pods
  • Kubernetes Services
  • Kubernetes Namespaces
  • Replication Controller & ReplicaSet
  • Kubernetes Deployments
  • Kubernetes ConfigMap
  • Kubernetes Secrets
  • HELM Charts
  • EKS Cluster
  • Monitoring
  • Projects Setup using helm chart for big data pipeline (kafka,spark ,cassandra,airflow ,nifi)

Course Curriculum

What do we offer

Live learning

Learn live with top educators, chat with teachers and other attendees, and get your doubts cleared.

Structured learning

Our curriculum is designed by experts to make sure you get the best learning experience.

Community & Networking

Interact and network with like-minded folks from various backgrounds in exclusive chat groups.

Learn with the best

Stuck on something? Discuss it with your peers and the instructors in the inbuilt chat groups.

Practice tests

With the quizzes and live tests practice what you learned, and track your class performance.

Get certified

Flaunt your skills with course certificates. You can showcase the certificates on LinkedIn with a click.

Reviews

Enroll Now