What is Amazon Redshift?

Nadtakan Futhoem
3 min readJul 1, 2021

--

Amazon Redshift — A fast, fully-managed petabyte-scale data warehouse

What is a data warehouse?

A data warehouse is used to consolidate data from multiple sources to allow you to run business intelligent tools, across your data, to help you identify actionable business information, which can then be used to direct and drive your organization to make effective data-driven decisions to the benefit of your company.

As a result, using a data warehouse is a very effective way to manage your reporting and data analysis at scale.

A data warehouse by its very nature needs to be able to store huge amounts of data and its data may be subjected to different data operations such as data cleansing.

Extraction — is the process of retrieving data from one or more sources either online, brick & mortar, legacy data, salesforce data, and more. After receiving data, ETL is to compute work that loads it into a staging area and prepares it for the next phase.

Transformation — is the process of mapping, reformatting, conforming, adding meaning, and more to prepare the data in a way that is more easily consumed.

Loading — Loading involves successfully inserting the transformed data into the target database data store, or in this case, a data warehouse. All of this work is processed in what the business intelligent developers call an ETL job.

Components

Amazon Redshift cluster — is a set of nodes, which consists of a leader node and one or more computes nodes. The type and number of compute nodes that you need depend on the size of your data, the number of queries you will execute, and the query execution performance that you need.

Leader node — The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operation, in particular the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node complies with code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.

Compute node — The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute nodes. The compute nodes execute the compiled code and send intermediate results back to the leader node for final aggregation.

Each compute node has its own dedicated CPU, memory, and attached disk storage, which is determined by the node type. As your workload grows, you can increase the compute capacity and storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.

Node slice — A compute node is partitioned into slices. Each slice is allocated a portion of the node’s memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages to distribute data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation.

Performance Features:

  • Massive Parallel Process(MPP)
  • Columnar data storage
  • Result catching

Amazon Redshift integrates with Amazon CloudWatch, allowing you to monitor the performance of your physical resources such as CPU utilization and throughput.

In addition to this, Redshift also generates query and load performance data that enables you to track overall database performance. Any data relating to query and load performance is only accessible from within the RedShift console itself and not Amazon CloudwWatch.

Nadtakan Futhoem — Sr. Software Engineer

--

--

Nadtakan Futhoem
Nadtakan Futhoem

No responses yet