Amazon Redshift System Overview
Amazon Redshift is an enterprise-level, petabyte scale, fully managed data warehousing service.
An Amazon Redshift data warehouse is an enterprise-class relational database query and management system.
Amazon Redshift supports client connections with many types of applications, including business intelligence (BI), reporting, data, and analytics tools.
When you execute analytic queries, you are retrieving, comparing, and evaluating large amounts of data in multiple-stage operations to produce a final result.
Amazon Redshift achieves efficient storage and optimum query performance through a combination of massively parallel processing, columnar data storage, and very efficient, targeted data compression encoding schemes. This section presents an introduction to the Amazon Redshift system architecture.
Data Warehouse System Architecture
This section introduces the elements of the Amazon Redshift data warehouse architecture as shown in the following figure.
Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools and business intelligence (BI) reporting, data mining, and analytics tools. Amazon Redshift is based on industry-standard PostgreSQL, so most existing SQL client applications will work with only minimal changes. For information about important differences between Amazon Redshift SQL and PostgreSQL, see Amazon Redshift and PostgreSQL.
Amazon Redshift communicates with client applications by using industry-standard PostgreSQL JDBC and ODBC drivers. For more information, see Amazon Redshift and PostgreSQL JDBC and ODBC.
The core infrastructure component of an Amazon Redshift data warehouse is a cluster.
A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external communication. Your client application interacts directly only with the leader node. The compute nodes are transparent to external applications.
The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.
The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any of these functions will return an error if it references tables that reside on the compute nodes. For more information, see SQL Functions Supported on the Leader Node.
The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute nodes. The compute nodes execute the compiled code and send intermediate results back to the leader node for final aggregation.
Each compute node has its own dedicated CPU, memory, and attached disk storage, which is determined by the node type. As your workload grows, you can increase the compute capacity and storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.
Amazon Redshift provides two node types; dense storage nodes and dense compute nodes. Each node provides two storage choices. You can start with a single 160 GB node and scale up to multiple 16 TB nodes to support a petabyte of data or more.
For a more detailed explanation of data warehouse clusters and nodes, see Internal Architecture and System Operation.
A compute node is partitioned into slices. Each slice is allocated a portion of the node’s memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages to distribute data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation.
The number of slices per node is determined by the node size of the cluster. For more information about the number of slices for each node size, go to About Clusters and Nodes in the Amazon Redshift Cluster Management Guide.
When you create a table, you can optionally specify one column as the distribution key. When the table is loaded with data, the rows are distributed to the node slices according to the distribution key that is defined for a table. Choosing a good distribution key enables Amazon Redshift to use parallel processing to load data and execute queries efficiently. For information about choosing a distribution key, see Choose the Best Distribution Style.
Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom communication protocols to provide private, very high-speed network communication between the leader node and compute nodes. The compute nodes run on a separate, isolated network that client applications never access directly.
A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client communicates with the leader node, which in turn coordinates query execution with the compute nodes.
Amazon Redshift is a relational database management system (RDBMS), so it is compatible with other RDBMS applications. Although it provides the same functionality as a typical RDBMS, including online transaction processing (OLTP) functions such as inserting and deleting data, Amazon Redshift is optimized for high-performance analysis and reporting of very large datasets.
Amazon Redshift is based on PostgreSQL 8.0.2. Amazon Redshift and PostgreSQL have a number of very important differences that you need to take into account as you design and develop your data warehouse applications. For information about how Amazon Redshift SQL differs from PostgreSQL, see Amazon Redshift and PostgreSQL.
Internal Architecture and System Operation
The following diagram shows a high-level view of internal components and functionality of the Amazon Redshift data warehouse.
Amazon Redshift achieves extremely fast query execution by employing these performance features.
Massively Parallel Processing
Massively parallel processing (MPP) enables fast execution of the most complex queries operating on large amounts of data. Multiple compute nodes handle all query processing leading up to final result aggregation, with each core of each node executing the same compiled query segments on portions of the entire data.
Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processed in parallel. By selecting an appropriate distribution key for each table, you can optimize the distribution of data to balance the workload and minimize movement of data from node to node. For more information, see Choose the Best Distribution Style.
Loading data from flat files takes advantage of parallel processing by spreading the workload across multiple nodes while simultaneously reading from multiple files. For more information about how to load data into tables, see Best Practices for Loading Data.
Columnar Data Storage
Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an important factor in optimizing analytic query performance. Storing database table information in a columnar fashion reduces the number of disks I/O requests and reduces the amount of data you need to load from disk. Loading fewer data into memory enables Amazon Redshift to perform more in-memory processing when executing queries. See Columnar Storage for a more detailed explanation.
When columns are sorted appropriately, the query processor is able to rapidly filter out a large subset of data blocks. For more information, see Choose the Best Sort Key.
Data compression reduces storage requirements, thereby reducing disk I/O, which improves query performance. When you execute a query, the compressed data is read into memory, then uncompressed during query execution. Loading fewer data into memory enables Amazon Redshift to allocate more memory to analyzing the data. Because columnar storage stores similar data sequentially, Amazon Redshift is able to apply adaptive compression encodings specifically tied to columnar data types. The best way to enable data compression on table columns is by allowing Amazon Redshift to apply optimal compression encodings when you load the table with data. To learn more about using automatic data compression, see Loading Tables with Automatic Compression.
The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer implements significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, subqueries, and aggregation. To learn more about optimizing queries, see Tuning Query Performance.
To reduce query execution time and improve system performance, Amazon Redshift caches the results of certain types of queries in memory on the leader node. When a user submits a query, Amazon Redshift checks the results cache for a valid, cached copy of the query results. If a match is found in the result cache, Amazon Redshift uses the cached results and doesn’t execute the query. Result caching is transparent to the user.
Result caching is enabled by default. To disable result caching for the current session, set the enable_result_cache_for_session parameter to
Amazon Redshift uses cached results for a new query when all of the following are true:
- The user submitting the query has access privilege to the objects used in the query.
- The table or views in the query haven’t been modified.
- The query doesn’t use a function that must be evaluated each time it’s run, such as GETDATE.
- Configuration parameters that might affect query results are unchanged.
- The query syntactically matches the cached query.
To maximize cache effectiveness and efficient use of resources, Amazon Redshift doesn’t cache some large query result sets. Amazon Redshift determines whether to cache query results based on a number of factors. These factors include the number of entries in the cache and the instance type of your Amazon Redshift cluster.
To determine whether a query used the result cache, query the SVL_QLOG system view. If a query using the result cache, the source_query column returns the query ID of the source query. If result caching wasn’t used, the source_query column value is NULL.
The following example shows that queries submitted by user id 104 and user id 102 use the result cache from queries run by user id 100.
select userid, query, elapsed, source_query from svl_qlog
where userid > 1
order by query desc;
userid | query | elapsed | source_query
104 | 629035 | 27 | 628919
104 | 629034 | 60 | 628900
104 | 629033 | 23 | 628891
102 | 629017 | 1229393 |
102 | 628942 | 28 | 628919
102 | 628941 | 57 | 628900
102 | 628940 | 26 | 628891
100 | 628919 | 84295686 |
100 | 628900 | 87015637 |
100 | 628891 | 58808694 |
For details about the queries used to create the results shown in the previous example, see Step 2: Test System Performance to Establish a Baseline in the Tuning Table Design tutorial.
The leader node distributes fully optimized compiled code across all of the nodes of a cluster. Compiling the query eliminates the overhead associated with an interpreter and therefore increases the execution speed, especially for complex queries. The compiled code is cached and shared across sessions on the same cluster, so subsequent executions of the same query will be faster, often even with different parameters.
The execution engine compiles different code for the JDBC connection protocol and for ODBC and psql (libq) connection protocols, so two clients using different protocols will each incur the first-time cost of compiling the code. Other clients that use the same protocol, however, will benefit from sharing the cached code.
Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries.
Amazon Redshift WLM creates query queues at runtime according to service classes, which define the configuration parameters for various types of queues, including internal system queues and user-accessible queues. From a user perspective, a user-accessible service class and a queue are functionally equivalent. For consistency, this documentation uses the term queue to mean a user-accessible service class as well as a runtime queue.
When you run a query, WLM assigns the query to a queue according to the user’s user group or by matching a query group that is listed in the queue configuration with a query group label that the user sets at runtime.
By default, Amazon Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one. You can define up to eight queues. Each queue can be configured with a maximum concurrency level of 50. The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.
The easiest way to modify the WLM configuration is by using the Amazon Redshift Management Console. You can also use the Amazon Redshift command line interface (CLI) or the Amazon Redshift API.
Using Amazon Redshift with Other Services
Amazon Redshift integrates with other AWS services to enable you to move, transform, and load your data quickly and reliably, using data security features.
Moving Data Between Amazon Redshift and Amazon S3
Amazon Simple Storage Service (Amazon S3) is a web service that stores data in the cloud. Amazon Redshift leverages parallel processing to read and load data from multiple data files stored in Amazon S3 buckets. For more information, see Loading Data from Amazon S3.
You can also use parallel processing to export data from your Amazon Redshift data warehouse to multiple data files on Amazon S3. For more information, see Unloading Data.
Using Amazon Redshift with Amazon DynamoDB
Amazon DynamoDB is a fully managed NoSQL database service. You can use the COPY command to load an Amazon Redshift table with data from a single Amazon DynamoDB table. For more information, see Loading Data from an Amazon DynamoDB Table.
Importing Data from Remote Hosts over SSH
You can use the COPY command in Amazon Redshift to load data from one or more remote hosts, such as Amazon EMR clusters, Amazon EC2 instances, or other computers. COPY connects to the remote hosts using SSH and executes commands on the remote hosts to generate data. Amazon Redshift supports multiple simultaneous connections. The COPY command reads and loads the output from multiple host sources in parallel. For more information, see Loading Data from Remote Hosts.
Automating Data Loads Using AWS Data Pipeline
You can use AWS Data Pipeline to automate data movement and transformation into and out of Amazon Redshift. By using the built-in scheduling capabilities of AWS Data Pipeline, you can schedule and execute recurring jobs without having to write your own complex data transfer or transformation logic. For example, you can set up a recurring job to automatically copy data from Amazon DynamoDB into Amazon Redshift. For a tutorial that walks you through the process of creating a pipeline that periodically moves data from Amazon S3 to Amazon Redshift, see Copy Data to Amazon Redshift Using AWS Data Pipeline in the AWS Data Pipeline Developer Guide.
Migrating Data Using AWS Database Migration Service (AWS DMS)
You can migrate data to Amazon Redshift using AWS Database Migration Service. AWS DMS can migrate your data to and from most widely used commercial and open-source databases such as Oracle, PostgreSQL, Microsoft SQL Server, Amazon Redshift, Aurora, DynamoDB, Amazon S3, MariaDB, and MySQL. For more information, see Using an Amazon Redshift Database as a Target for AWS Database Migration Service.
Source: Welcome – Amazon Redshift