46. Redshift

i. Way of doing BI or data warehousing in the cloud.
ii. Fast & powerful, fully managed, petabyte scale data warehouse service in the cloud.
iii. Customers can start small for $0.25 (25 cents)/ hour with no commitments or upfront costs and scale to a petabyte or more for $1000 per terabyte per year which is less than one tenth cost of most other data warehousing solutions.
iv) Data warehousing DBs use different type of architecture, both from a DB perspective & infrastructure layer.
v) Amazons data warehouse solution is called Redshift
vi) Redshift can be configured as follows:
a) Single node: 160 GB
b) Multi node
bi) Leader node: Manages client connections & receives queries
bii) Compute node: Store data & perform queries & computations. Have upto 128 compute nodes.
vii) Advanced compression: Columnar data stores can be compressed much more than the row based data stores because similar data is stored sequentially on the disk. Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. Also Redshift does not require indexes or materialized views and uses less space than traditional relational DB systems. When loading data into empty table, Redshift automatically samples data & selects most appropriate compression scheme.

Massively Parallel Processing (MPP):
Redshift automatically distributes data and query load across all nodes. Redshift makes it easy to add nodes to data warehouse and enables to maintain fast query performance as the data warehouse grows.

Backups:
i) Enabled by default with 1 day retention period.
ii) Max. retention period is 35 days
iii) Redshift always attempts to maintain at least three copies of data (the original, replica on the compute nodes and backup in S3)
iv) Redshift can also asynchronously replicate snapshots to S3 in another region for DR.

Redshift is priced as follows:
i) Compute node hours: Total number of hours you run across all your compute nodes for the billing period. We are billed for 1 unit per node per hour, so a 3 node data warehouse cluster running persistently for an entire month would incur 2160 instance hours. You will not be charged for leader node hours, only compute nodes will incur charges.
ii) Backup
iii) Data transfer (only within VPC, not outside it)

Security considerations:
i) Encrypted in transit using SSL
ii) Encrypted at rest using AES-256 encryption
iii) By default Redshift takes care of key management
a) Manage own keys through HSM
b) AWS KMS

Availability:
i) Currently available only in one AZ
ii) Can restore snapshots to new AZs in the event of an outage.

Question 1:
Your company has developed an IoT application that sends Telemetry data from
100,000 sensors. The sensors send a datapoint of 1 KB at one-minute intervals to
a DynamoDB collector for monitoring purposes. What AWS stack would enable
you to store data for real-time processing and analytics using BI tools?
A. Sensors -> Kinesis Stream -> Firehose -> DynamoDB
B. Sensors -> Kinesis Stream -> Firehose -> DynamoDB -> S3
C. Sensors -> AWS IoT -> Firehose -> RedShift
D. Sensors -> Kinesis Data Streams -> Firehose -> RDS
Answer (C)

Question 2:
Your company is a provider of online gaming that customers access with various network access devices including mobile phones. What is a data warehousing solutions for large amounts of information on player behavior, statistics and events for analysis using SQL tools?
A. RedShift
B. DynamoDB
C. RDS
D. DynamoDB
E. Elasticsearch
Answer (A)

Question 3:
What two statements are correct when comparing Elasticsearch
and RedShift as analytical tools?
A. Elasticsearch is a text search engine and document indexing tool
B. RedShift supports complex SQL-based queries with Petabyte sized data
store
C. Elasticsearch supports SQL queries
D. RedShift provides only basic analytical services
E. Elasticsearch does not support JSON data type
Answer (A,B)
Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.

Question 4:
As a solutions architect, you are building a business analysis system. This system requires a highly available relational database with an initial storage capacity of 8 TB. As a requirement, you predicts that the amount of data will increase by 10 GB daily. In addition, parallel processing is required for data processing in order to handle this expected traffic volume.
Choose the best service that meets this requirement.
Options:
A. Dynamo DB
B. RDS
C. Aurora
D. Redshift
Answer: D
Explanation
Option 4 is the correct answer. Redshift is the best database for business analysis systems. Redshift is capable of big data storage and parallel query analysis processing to meet your requirements. Redshift is a petabyte-scale, relational database-type ,data warehouse service that is fully managed in the cloud. Redshift distributes table rows to compute nodes so you can process data in parallel. By choosing the appropriate distribution key for each table, you can optimize the distribution of data, distribute the workload, and minimize the movement of data between nodes.
Option 1 is incorrect. DynamoDB is a NoSQL database and does not meet the requirements of a highly available relational database.
Option 2 is incorrect. RDS is a relational database, but it cannot be used as a business analysis system. RDS can parallelize the read process by configuring a read replica. However, data analysis itself cannot be processed in parallel.
Option 3 is incorrect. Aurora MySQL can parallelize part of data-intensive query processing and computational processing. However, considering the requirements of a business analysis system, Redshift is more suitable than Aurora, so that is the priority in this scenario.

Question 5:
Your company is trying to build a BI (business intelligence) system using AWS Redshift. As the Solutions Architect, you are required to use Redshift clusters in a cost-effective manner.
Which of the options will help to meet this requirement?
Options:
A. Removing unnecessary snapshot settings
B. Not use VPC enhanced routing
C. Using Spot instances in your cluster
D. Removing unnecessary CloudWatch metric settings.
Answer: A
Explanation
Redshift offers free storage for snapshots, but you’ll be charged if you run out of storage. For this reason, you will be charged when the free snapshot space reaches the upper limit. To avoid this, you should save automatic snapshots and delete manual snapshots that you no longer need. You could do this by reducing the retention period. Therefore, option 1 is the correct answer.
Option 2 is incorrect. With enhanced VPC routing in Amazon Redshift, Amazon Redshift forces all COPY and UNLOAD traffic between your cluster and data repository to go through your Amazon VPC. The presence or absence of this setting does not affect the cost.
Option 3 is incorrect. If you use Spot Instances instead of On-Demand Instances, the process may be stopped in the middle of running, so this is an inappropriate setting.
Reserved Instances are available on Amazon Redshift. Knowing this, you could avoid the spot instance stoppage risks and save up to 75% (vs. on-demand pricing) by signing a one-year or three-year contract.
Option 4 is incorrect. CloudWatch metrics are free to use, but you’ll be charged for custom metrics. Deleting unnecessary CloudWatch metric settings is not appropriate because it does not stop any additional charges.

Question 6:
An IT company has built a solution wherein a Redshift cluster writes data to an Amazon S3 bucket belonging to a different AWS account. However, it is found that the files created in the S3 bucket using the UNLOAD command from the Redshift cluster are not even accessible to the S3 bucket owner.
What could be the reason for this denial of permission for the bucket owner?
A• When objects are uploaded to S3 bucket from a different AWS account, the S3 bucket owner will get implicit permissions to access these objects. This issue seems to be due to an upload error that can be fixed by providing manual access from AWS console
B• The owner of an S3 bucket has implicit access to all objects in his bucket. Permissions are set on objects after they are completely copied to the target location. Since the owner is unable to access the uploaded files, the write operation may be still in progress
C• When two different AWS accounts are accessing an S3 bucket, both the accounts must share the bucket policies. An erroneous policy can lead to such permission failures
D• By default, an S3 object is owned by the AWS account that uploaded it. So the S3 bucket owner will not implicitly have access to the objects written by the Redshift cluster
Answer: D
Explanation
Correct option:
By default, an S3 object is owned by the AWS account that uploaded it. So the S3 bucket owner will not implicitly have access to the objects written by Redshift cluster – By default, an S3 object is owned by the AWS account that uploaded it. This is true even when the bucket is owned by another account. Because the Amazon Redshift data files from the UNLOAD command were put into your bucket by another account, you (the bucket owner) don’t have default permission to access those files.
To get access to the data files, an AWS Identity and Access Management (IAM) role with cross-account permissions must run the UNLOAD command again. Follow these steps to set up the Amazon Redshift cluster with cross-account permissions to the bucket:
1. From the account of the S3 bucket, create an IAM role (Bucket Role) with permissions to the bucket.
2. From the account of the Amazon Redshift cluster, create another IAM role (Cluster Role) with permissions to assume the Bucket Role.
3. Update the Bucket Role to grant bucket access and create a trust relationship with the Cluster Role.
4. From the Amazon Redshift cluster, run the UNLOAD command using the Cluster Role and Bucket Role.
This solution doesn’t apply to Amazon Redshift clusters or S3 buckets that use server-side encryption with AWS Key Management Service (AWS KMS).
Incorrect options:
When objects are uploaded to S3 bucket from a different AWS account, the S3 bucket owner will get implicit permissions to access these objects. This issue seems to be due to an upload error that can be fixed by providing manual access from AWS console – By default, an S3 object is owned by the AWS account that uploaded it. So, the bucket owner will not have any default permissions on the objects. Therefore, this option is incorrect.
The owner of an S3 bucket has implicit access to all objects in his bucket. Permissions are set on objects after they are completely copied to the target location. Since the owner is unable to access the uploaded files, the write operation may be still in progress – This is an incorrect statement, given only as a distractor.
When two different AWS accounts are accessing an S3 bucket, both the accounts must share the bucket policies. An erroneous policy can lead to such permission failures – This is an incorrect statement, given only as a distractor.