51. EMR

EMR = Elastic Map Reduce
Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open-source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi and Presto. With EMR, we can run petabyte-scale analysis at less than half the cost of traditional on-premises solutions and over three times faster than standard Apache Spark.

The central component of EMR is the cluster. A cluster is a collection of EC2 instances. Each instance in the cluster is called a node. Each node has a role within the cluster, referred to as the node type.

Amazon EMR uses Hadoop, an open-source framework, to distribute your data and processing across a resizable cluster of Amazon EC2 instances.

EMR also installs different s/w components on each node type, giving each node a role in a distributed application like Apache Hadoop.

EMR Node types:
i) Master node: A node that manages the cluster. The master node tracks the status of tasks and monitors the health of the cluster. Every cluster has a master node.
ii) Core node: A node with s/w components that runs tasks and stores data in HDFS (Hadoop Distributed File System) on cluster. Multi-node clusters have at least one core node.
iii) Task node: A node with s/w components that only runs tasks and does not store data in HDFS. Task nodes are optional.

We can configure a cluster to periodically archive the log files stored on the master node to S3. This ensures the log files are available after the cluster terminates, whether this is through normal shutdown or due to an error. EMR archives the log files to S3 at five-minute intervals.

AWS Glue – AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. AWS Glue job is meant to be used for batch ETL data processing.

Recap:
i. EMR is used for big data processing
ii. Consists of a master node, a core node and optionally a task node.
iii. By default, log data is stored on the master node.
iv. We can configure replication to S3 on five-minute intervals for all log data from the master node. However this can only be configured when creating the cluster for the first time.

Question 1:
What Amazon AWS platform is designed for complex analytics of a variety of
large data sets based on custom code. The applications include machine learning
and data transformation?
A. EC2
B. Beanstalk
C. Redshift
D. EMR
Answer (D)