Cloud Devops Data Engineering

Created by Arvind Agrawal

English

Cloud Devops Data Engineering

Language: English

Instructors: Arvind Agrawal

About the course

Hadoop Bigdata Cloud devops training Syllabus

Module1: Hadoop Eco System components

(HDFS,Mapreduce,hive,sqoop, hbase ,flume)

===========================================================

HDFS

1.Architecture
2.hdfs features
3.read and write operations in hdfs
4.hdfs developer commands,hdfs admin commands
5. hdfs data blocks
6. rack awareness
7. high availability
8. Fault Tolerance
9.Name Node high Availability
10. hdfs Federation

MapReduce

1.Introduction
2.Architecture
3.Mapper ,shuffle and sort ,Reducer
4.Key-value pairs
5.input format,input split, Record reader,output format
6.Partitioner,Combiner
7.Map Side join ,Reduce side join,distributed Cache
8.Counter
9.performance tuning in mapreduce

Hive

1.Introduction
2.Architecture
3.Built-In-Function
4.UDFs(UDF,UDAF,UDTF)
5.DDL commands(CREATE,SHOW,DESCRIBE,USEDROP,ALTER,TRUNCATE)
6.DML Commands(LOAD,SELECT,INSERT,DELETE,UPDATE,EXPORT,IMPORT)7.Apache Hive View and Hive Index
8.Hive Metastore – Different Ways to Configure Hive Metastore
9.Hive Data Model – Table,Partition,Bucket
10.Hive Data Types – Primitive and Complex Data Types in Hive - Complex data types - Array ,Struct,Map
11.Hive Operators –Relational Operators,Arithmetic Operators,Logical Operators,String Operators,Operators on Complex Types
12.Hive SerDe – Custom & Built-in SerDe in Hive
JsonSerde,OpenCSVSerde,ParquetSerde,OrcSerde,XmlSerde,RegexSerde
13.Hive Partitions, Types of Hive Partitioning with Examples
Static Partitioning,Dynamic Partitioning
13.Bucketing in Hive – Creation of Bucketed Table in Hive
14.Hive Join | HiveQL Select Joins Query | Types of Join in Hive
Inner,Left Outer Join,Right Outer Join,Full Outer Join,Self join,Cross Join,Map Join,Bucket Map Join,Skew Join,Sort Merge Bucket Join
15. Internal vs External Table
16. configure Mysql metastore
17. HQL select statements,Group By,Having,Grouping Sets ,Rollup and Cube,Order By Query,Sort by ,clustered by
18. Windows function- Row_number,rank,dense_rank(),lead(),lag(),first_value(),last_value()
19.Hive Optimization Techniques – Hive Performance
20.hive Security - Authentication,Authorization,Encryption
21 Hive Transaction management

Sqoop

1.sqoop architecture
2.sqoop features
3.sqoop eval
4.sqoop import
5.sqoop import-all tables
6.sqoop validation
7.sqoop export
8.sqoop incremental jobs
9.sqoop jobs
10.sqoop codegen
11.sqoop merge
12.sqoop metastore
13.sqoop list-databases
14.sqoop list-tables
15.sqoop connectors & drivers
16.Import Mainframe
17.Hcatalog Integration
18.Troubleshooting issue in sqoop
19.sqoop performance tuning
Hbase
1.Hbase Architecture
2.Hbase features
3.Hbase Use Cases
4.Hbase operations
5.Hbase commands
6.Table Management Commands in HBase
7.Data Manipulation HBase Command – Create, Truncate, Scan
8.HBase Admin API – Class Descriptor & Class HBaseAdmin
9.HBase Client API – HTable, Put, Get, Delete, Result
10.HBase MemStore – Uses, Benefits & Configuration
11.HBase Security: Kerberos Authentication & Authorization
12.HBase vs RDBMS: Feature Wise Comparison
13.HBase vs Impala: Compare Which is Better
14.HBase Troubleshooting – Problem, Cause & Solution
15.HBase Performance Tuning | Ways For HBase Optimization

Module 2 - Spark Core Components:

spark ->(spark core ,sql,Streaming,mysql integration,Mongodb,Cassandra,snowflakes,ElasticSearch,Sparkkafka streaming,Hbase integration)

Spark-Apache Spark Core

1.Spark Introduction
2.Apache Spark Ecosystem – Complete Spark Component
3.Features of Apache Spark – Learn the benefits of using Spark
4.Apache Spark Use Cases in Real Time
5.Spark Shell Commands to Interact with Spark-Scala
6.Spark Shell Commands to Interact with Spark-python
7.Learn SparkContext,SparkSession – Introduction and Functions
8.Spark Stage,tasks- An Introduction to Physical Execution plan
9.Spark RDD – Introduction, Features & Operations of RDD
10.RDD Persistence and Caching Mechanism in Apache Spark
11.Shining Features of Spark RDD You Must Know
12.Introduction to Apache Spark Paired RDD
13.How to Overcome the Limitations of RDD in Apache Spark?
14.Spark RDD Operations-Transformation & Action with Example
15.RDD lineage in Spark: ToDebugString Method
16.Apache Spark Map vs FlatMap Operation
17.Spark In-Memory Computing – A Beginners Guide
18.Lazy Evaluation,Fault tolerance ,Directed Acyclic Graph DAG in Apache Spark
19.Apache Spark Cluster Managers – YARN, Mesos & Standalone , how it works
20.Spark Performance Tuning-Learn to Tune Apache Spark Job
Spark- Apache Spark SQL
1.Apache Spark SQL Tutorial – Quick Introduction Guide
2.Spark SQL Features
3.Spark SQL DataFrame
4.Spark Dataset
5.Spark SQL Optimization – Understanding the Catalyst Optimizer
6.Apache Spark RDD vs DataFrame vs DataSet
7.Spark Mysql integration
8.spark Hive Integration
9.Spark MongoDB integration(including MongoDB Hands on)
10.Spark Cassandra Integration(including Cassandra Hands On)
12.Spark Hbase Integration
13.Spark ElasticSearch Integration(including Elasticsearch Hands On)
14.Spark joining strategies,Spark joins(inner join,left outer ,right outer join,self join ,cross join,full outer join),skew join,broadcast join
15.Spark storage format(Parquet,avro and ORC)
16.Spark dataframe api for window function(row_number,rank,dense_rank,lead,lag,first_value and last_value)
17.spark sql api

Spark-Apache Spark Streaming

1.Spark streaming introduction
2.Apache Spark DStream (Discretized Streams)
3.Apache Spark Streaming Transformation Operations
4.Spark Streaming Checkpoint in Apache Spark
5.Spark Watermarking Checkpoint in Apache Spark
6.Spark Kafka Integration

Module 3 - Big data Cloud Technologies:

AWS Services:-

aws data engineer -> (lambda,glue,emr, kinesis,dynamodb,RDS,EC2,S3,Redshift)

======================================================

1.Big Data on AWS Introduction

Cloud Computing Introduction, Advantages, and Types
Cloud Deployment Models
Cloud Service Categories
AWS Cloud Platform
AWS Cloud Architecture Design Principles - Part I
AWS Cloud Architecture Design Principles - Part II
Why AWS for Big Data - Reasons and Challenges

2.Databases in AWS

Data Warehousing in AWS
Redshift, Kinesis, and EMR
DynamoDB, Machine Learning, and Lambda
ElasticSearch Services and EC2

3.Big Data on AWS - Collection

Amazon Kinesis and Kinesis Stream
Kinesis Data Stream Architecture and Core Components
Data Producer
Data Consumer
Kinesis Stream Emitting Data to AWS Services and Kinesis Connector Library
Kinesis Firehose
Demo - Put and Get Records from Kinesis Data Stream
Transferring Data Using Lambda
Amazon SQS Lifecycle and Architecture
IoT and Big Data
IoT Framework
AWS Data Pipelines and Data Nodes
Activity, Pre-Condition, and Schedule
Demo - Importing Data from S3 into DynamoDB Using Data Pipeline

4.Big Data on AWS - Storage

Amazon Glacier and Big Data
DynamoDB Introduction
DynamoDB and EMR
DynamoDB Partitions and Distributions
DynamoDB GSI LSI
DynamoDB Stream and Cross-Region Replication
DynamoDB Performance and Partition Key Selection
Snowball and AWS Big Data
AWS DMS
AWS Aurora in Big Data
Demo - Amazon Athena Interactive SQL Queries for Data in Amazon S3 Part I
Demo - Amazon Athena Interactive SQL Queries for Data in Amazon S3 Part II

5.Big Data on AWS - Processing

Amazon EMR
Demo - Analyzing Big Data with Amazon EMR
Apache Hadoop
EMR Architecture
EMR Operations - Releases and Cluster
EMR Operations - Choosing Instance and Monitoring
Demo - Advanced EMR Setting Options
Hive on EMR
HBase with EMR
Presto with EMR
Spark with EMR
EMR File Storage
Demo - Analyzing Large Datasets Using Hive and Spark
AWS Lambda

6.Big Data on AWS - Analysis

Redshift Intro and Use Cases
Redshift Architecture
MPP and Redshift in AWS Ecosystem
Columnar Databases
Redshift Table Design - Part I
Redshift Table Design - Part II
Demo - Generating Random Dataset in EC2 and Loading it in S3
Demo - Redshift Maintenance and Operations
Machine Learning Introduction
Machine Learning Algorithm
Amazon SageMaker
Amazon Elasticsearch
Amazon Elasticsearch Services
Demo - Loading Datasets into Elasticsearch
Logstash and RStudio
Demo - Fetching the File and Analyzing it using RStudio
Athena
Demo - Running Query on S3 using the Serverless Athena
Demo - Creating a Redshift Cluster and Loading the Datasets into it from S3 - Part I
Demo - Creating a Redshift Cluster and Loading the Datasets into it from S3 - Part II

7.Big Data on AWS - Visualization

Amazon QuickSight
Demo - Creating an Analysis with a Single Visual using Sample Data
Demo - Creating an Analysis using Your Own Amazon S3 Data
Big Data Visualization

8.Big Data on AWS - Security

EMR Security and Security Group
Roles and Private Subnet
Encryption at Rest and In-Transit
Redshift Security
Encryption at Rest using CloudHSM
Cloud HSM versus AWS KMS
Limit Data Access

Azure Services:-

azure data engineer -> (azure functions, azure blob storage, azure data factory ,azure databricks and azure synapse ,azure event hub)

======================

1.Working with Azure Blob Storage

2.Working with Relational Databases in Azure

Provisioning and connecting to an Azure SQL database using PowerShell
Provisioning and connecting to an Azure PostgreSQL database using the Azure CLI
Provisioning and connecting to an Azure MySQL database using the Azure CLI
Implementing active geo-replication for an Azure SQL database using PowerShell
Implementing an auto-failover group for an Azure SQL database using PowerShell
Implementing vertical scaling for an Azure SQL database using PowerShell
Implementing an Azure SQL database elastic pool using PowerShell
Monitoring an Azure SQL database using the Azure portal

3.Analyzing Data with Azure Synapse Analytics

Provisioning and connecting to an Azure Synapse SQL pool using PowerShell
Pausing or resuming a Synapse SQL pool using PowerShell
Scaling an Azure Synapse SQL pool instance using PowerShell
Loading data into a SQL pool using PolyBase with T-SQL
Loading data into a SQL pool using the COPY INTO statement
Implementing workload management in an Azure Synapse SQL pool
Optimizing queries using materialized views in Azure Synapse Analytics

4.Control Flow Activities in Azure Data Factory

Implementing control flow activities
Implementing control flow activities – Lookup and If activities
Triggering a pipeline in Azure Data Factory

5.Control Flow Transformation and the Copy Data Activity in Azure Data Factory

Implementing HDInsight Hive and Pig activities
Implementing an Azure Functions activity
Implementing a Data Lake Analytics U-SQL activity
Copying data from Azure Data Lake Gen2 to an Azure Synapse SQL pool copy activity
Copying data from Azure Data Lake Gen2 to Azure Cosmos DB using the copy activity

6.Data Flows in Azure Data Factory

Implementing incremental data loading with a mapping data flow
Implementing a wrangling data flow

7.Azure Data Factory Integration Runtime

Configuring a self-hosted IR
Configuring a shared self-hosted IR
Migrating an SSIS package to Azure Data Factory
Executing an SSIS package with an on-premises data store

8.Deploying Azure Data Factory Pipelines

Configuring the development, test, and production environments
Deploying Azure Data Factory pipelines using the Azure portal and ARM templates
Automating Azure Data Factory pipeline deployment using Azure DevOps

9.Batch and Streaming Data Processing with Azure Databricks

Configuring the Azure Databricks environment
Transforming data using Python
Transforming data using Scala
Working with Delta Lake

10. Processing structured streaming data with Azure Databricks

GCP Services:-

GCP data engineer -> (Gcp data proc, pub sub, apache beam , composer,Gcp sql data storages, Big query and nosql database)

GCP Big Data engineer

======================================================

1.Getting Started with Data Engineering with GCP

2.Big Data Capabilities on GCP

Understanding what the cloud is
Getting started with Google Cloud Platform
A quick overview of GCP services for data engineering

3.Building Solutions with GCP Components

4. Building a Data Warehouse in BigQuery

Introduction to Google Cloud Storage and BigQuery
Introduction to the BigQuery console
Preparing the prerequisites before developing our data warehouse
Practicing developing a data warehouse

5.Building Orchestration for Batch Data Loading Using Cloud Composer

Introduction to Cloud Composer
Understanding the working of Airflow
Exercise: Build data pipeline orchestration using Cloud Composer

6.Building a Data Lake Using Dataproc

Introduction to Dataproc
Exercise – Building a data lake on a Dataproc cluster
Exercise: Creating and running jobs on a Dataproc cluster
Understanding the concept of the ephemeral cluster
Building an ephemeral cluster using Dataproc and Cloud Composer

7.Processing Streaming Data with Pub/Sub and Dataflow

Processing streaming data
Exercise – Publishing event streams to cloud Pub/Sub
Exercise – Using Cloud Dataflow to stream data from Pub/Sub to GCS

8.Visualizing Data for Making Data-Driven Decisions with Data Studio

Unlocking the power of your data with Data Studio
From data to metrics in minutes with an illustrative use case
Understanding how Data Studio can impact the cost of BigQuery
How to create materialized views and understanding how BI Engine works

9.Key Strategies for Architecting Top-Notch Data Pipelines

10.User and Project Management in GCP

Understanding IAM in GCP
Planning a GCP project structure
Controlling user access to our data warehouse
Practicing the concept of IaC using Terraform

11.Cost Strategy in GCP

Estimating the cost of your end-to-end data solution in GCP
Tips for optimizing BigQuery using partitioned and clustered tables

12.CI/CD on Google Cloud Platform for Data Engineers

Introduction to CI/CD
Understanding CI/CD components with GCP services
Exercise – implementing continuous integration using Cloud Build
Exercise – deploying Cloud Composer jobs using Cloud Build

Module 4 - Snowflakes,Airflow,NIFI

Snowflakes

=================================================

1. Getting Started with Snowflake

Creating a new Snowflake instance
Creating a tailored multi-cluster virtual warehouse
Using the Snowflake WebUI and executing a query
Using SnowSQL to connect to Snowflake
Connecting to Snowflake with JDBC
Creating a new account admin user and understanding built-in roles

2.Managing the Data Life Cycle

Managing a database
Managing a schema
Managing tables
Managing external tables and stages
Managing views in Snowflake

3.Loading and Extracting Data into and out of Snowflake

Configuring Snowflake access to private S3 buckets
Loading delimited bulk data into Snowflake from cloud storage
Loading delimited bulk data into Snowflake from your local machine
Loading Parquet files into Snowflake
Making sense of JSON semi-structured data and transforming to a relational view
Processing newline-delimited JSON (or NDJSON) into a Snowflake table
Processing near real-time data into a Snowflake table using Snowpipe
Extracting data from Snowflake

4.Building Data Pipelines in Snowflake

Creating and scheduling a task
Conjugating pipelines through a task tree
Querying and viewing the task history
Exploring the concept of streams to capture table-level changes
Streams and Tasks to build pipelines that process changed data on a schedule
Converting data types and Snowflake's failure management
Managing context using different utility functions

5.Data Protection and Security in Snowflake

Setting up custom roles and completing the role hierarchy
Configuring and assigning a default role to a user
Delineating user management from security and role management
Configuring custom roles for managing access to highly secure data
Setting up development, testing, pre-production, and production database hierarchies and roles
Safeguarding the ACCOUNTADMIN role and users in the ACCOUNTADMIN role

6.Performance and Cost Optimization

Examining table schemas and deriving an optimal structure for a table
Identifying query plans and bottlenecks
Weeding out inefficient queries through analysis
Identifying and reducing unnecessary Fail-safe and Time Travel storage usage
Projections in Snowflake for performance
Reviewing query plans to modify table clustering
Optimizing virtual warehouse scale

7.Secure Data Sharing

Sharing a table with another Snowflake account
Sharing data through a view with another Snowflake account
Sharing a complete database with another Snowflake account and setting up future objects to be shareable
Creating reader accounts and configuring them for non-Snowflake sharing
Keeping costs in check when sharing data with non-Snowflake users

8.Back to the Future with Time Travel

Using Time Travel to return to the state of data at a particular time
Using Time Travel to recover from the accidental loss of table data
Identifying dropped databases, tables, and other objects and restoring them using Time Travel
Using Time Travel in conjunction with cloning to improve debugging
Using cloning to set up new environments based on the production environment rapidly

9.Advanced SQL Techniques

Managing timestamp data
Shredding date data to extract Calendar information
Unique counts and Snowflake
Managing transactions in Snowflake
Ordered analytics over window frames
Generating sequences in Snowflake

10.Extending Snowflake Capabilities

Creating a Scalar user-defined function using SQL
Creating a Table user-defined function using SQL
Creating a Scalar user-defined function using JavaScript
Creating a Table user-defined function using JavaScript
Connecting Snowflake with Apache Spark
Using Apache Spark to prepare data for storage on Snowflake

NIFI & Airflow

===================================================

1.Building Data Pipelines – Extract Transform, and Load

2.Building Our Data Engineering Infrastructure

Installing and configuring Apache NiFi
Installing and configuring Apache Airflow
Installing and configuring Elasticsearch
Installing and configuring Kibana
Installing and configuring PostgreSQL
Installing pgAdmin 4

3.Reading and Writing Files

Writing and reading files in Python
Building data pipelines in Apache Airflow
Handling files using NiFi processors

4.Working with Databases

Inserting and extracting relational data in Python
Inserting and extracting NoSQL database data in Python
Building data pipelines in Apache Airflow
Handling databases with NiFi processors

5.Cleaning, Transforming, and Enriching Data

Performing exploratory data analysis in Python
Handling common data issues using pandas
Cleaning data using Airflow

6.Deploying Data Pipelines in Production

7.Features of a Production Pipeline

Staging and validating data
Building idempotent data pipelines
Building atomic data pipelines

8.Version Control with the NiFi Registry

Installing and configuring the NiFi Registry
Using the Registry in NiFi
Versioning your data pipelines
Using git-persistence with the NiFi Registry

9.Monitoring Data Pipelines

Monitoring NiFi using the GUI
Monitoring NiFi with processors
Using Python with the NiFi REST API

10.Deploying Data Pipelines

Finalizing your data pipelines for production
Using the NiFi variable registry
Deploying your data pipelines

11.Building a Production Data Pipeline

Creating a test and production environment
Building a production data pipeline
Deploying a data pipeline in production

12.Beyond Batch – Building Real-Time Data Pipelines

13.Streaming Data with Apache Kafka

Understanding logs
Understanding how Kafka uses logs
Building data pipelines with Kafka and NiFi
Differentiating stream processing from batch processing
Producing and consuming with Python

14.Data Processing with Apache Spark

Installing and running Spark
Installing and configuring PySpark
Processing data with PySpark

Module 5 - Devops

devops ->(git,jenkins,docker and kubernetes)(spark ,kafka ,airflow ,hadoop pipeline using docker and kubernetes ,helm chart, Terraform)

GIT

GIT Features
3-Tree Architecture
GIT – Clone /Commit / Push
GIT revert and reset
GIT Branching strategies
GIT Rebase & Merge
GIT Stash, Reset, Checkout
GIT Clone, Fetch, Pull

Jenkins

Introduction to Jenkins
Continuous Integration with Jenkins
Configure Jenkins
Jenkins Management
Scheduling build Jobs
POLL SCM
Build Periodically
Maven Build Scripts
Support for the GIT version control System
Different types of Jenkins Jobs
Jenkins Build PipeLine
Parent and Child Builds
Sequential Builds
Jenkins Master & Slave Node Configuration

Docker

How to get Docker Image?
What is Docker Image
Docker Installation
Working with Docker Containers
What is Container
Docker Engine
Crating Containers with an Image
Working with Images
Docker Command Line Interphase
Docker Compose
Docker Hub
Docker Trusted Registry
Docker swarm
Docker attach
Docker File & Commands
Docker containers for kafka ,spark,cassandra etl pipeline

Kubernetes:

Kubernetes Introduction
Kubernetes Architecture
Kubernetes Setup (Self Managed,AWS managed)
Kubernetes Pods
Kubernetes Services
Kubernetes Namespaces
Replication Controller & ReplicaSet
Kubernetes Deployments
Kubernetes ConfigMap
Kubernetes Secrets
HELM Charts
EKS Cluster
Monitoring
Projects Setup using helm chart for big data pipeline (kafka,spark ,cassandra,airflow ,nifi)

Course Curriculum

What do we offer

Live learning

Learn live with top educators, chat with teachers and other attendees, and get your doubts cleared.

Structured learning

Our curriculum is designed by experts to make sure you get the best learning experience.

Community & Networking

Interact and network with like-minded folks from various backgrounds in exclusive chat groups.

Learn with the best

Stuck on something? Discuss it with your peers and the instructors in the inbuilt chat groups.

Practice tests

With the quizzes and live tests practice what you learned, and track your class performance.

Get certified

Flaunt your skills with course certificates. You can showcase the certificates on LinkedIn with a click.

You may also be interested in

Cloud Devops Data Engineering

Cloud Devops Data Engineering

About the course

Hadoop Bigdata Cloud devops training Syllabus

Module1: Hadoop Eco System components

Module 2 - Spark Core Components:

Module 3 - Big data Cloud Technologies:

AWS Services:-

Azure Services:-

GCP Services:-

Module 4 - Snowflakes,Airflow,NIFI

Snowflakes

NIFI & Airflow

Module 5 - Devops

GIT

Jenkins

Docker

Kubernetes:

Course Curriculum

What do we offer

Live learning

Structured learning

Community & Networking

Learn with the best

Practice tests

Get certified

Reviews