SQL Data Warehouse is a cloud-based Enterprise Data Warehouse (EDW) that leverages Massively Parallel Processing (MPP) to quickly run complex queries across petabytes of data. Use SQL Data Warehouse as a key component of a big data solution. Import big data into SQL Data Warehouse with simple PolyBase T-SQL queries, and then use the power of MPP to run high-performance analytics. As you integrate and analyze, the data warehouse will become the single version of truth your business can count on for insights.
Key component of big data solution
SQL Data Warehouse is a key component of an end-to-end big data solution in the Cloud.
In a cloud data solution, data is ingested into big data stores from a variety of sources. Once in a big data store, Hadoop, Spark, and machine learning algorithms prepare and train the data. When the data is ready for complex analysis, SQL Data Warehouse uses PolyBase to query the big data stores. PolyBase uses standard T-SQL queries to bring the data into SQL Data Warehouse.
SQL Data Warehouse stores data into relational tables with columnar storage. This format significantly reduces the data storage costs and improves query performance. Once data is stored in SQL Data Warehouse, you can run analytics at massive scale. Compared to traditional database systems, analysis queries finish in seconds instead of minutes or hours instead of days.
The analysis results can go to worldwide reporting databases or applications. Business analysts can then gain insights to make well-informed business decisions.
SQL Data Warehouse offers performance tiers designed for flexibility to meet your data needs, whether big or small. You can choose a data warehouse that is optimized for elasticity or for compute.
- The Optimized for Elasticity performance tier separates the compute and storage layers in the architecture. This option excels on workloads that can take full advantage of the separation between computing and storage by scaling frequently to support short periods of peak activity. This compute tier has the lowest entry price point and scales to support the majority of customer workloads.
- The Optimized for Compute performance tier uses the latest Azure hardware to introduce a new NVMe Solid State Disk cache that keeps the most frequently accessed data close to the CPUs, which is exactly where you want it. By automatically tiering the storage, this performance tier excels with complex queries since all IO is kept local to the compute layer. Furthermore, the column store is enhanced to store an unlimited amount of data in your SQL Data Warehouse. The Optimized for Compute performance tier provides the greatest level of scalability, enabling you to scale up to 30,000 compute Data Warehouse Units (cDWU). Choose this tier for workloads that require continuous, blazing fast, performance.
The following graphic shows the process of designing a data warehouse:
Queries and operations across tables
When you know in advance the primary operations and queries to be run in your data warehouse, you can prioritize your data warehouse architecture for those operations. These queries and operations might include:
- Joining one or two fact tables with dimension tables, filtering the combined table, and then appending the results into a data mart.
- Making large or small updates into your fact sales.
- Appending only data to your tables.
Knowing the types of operations in advance helps you optimize the design of your tables.
First, load your data into Azure Data Lake Store or Azure Blob storage. Next, use PolyBase to load your data into SQL Data Warehouse in a staging table. Use the following configuration:
|Resource Class||largerc or xlargerc|
Distributed or replicated tables
Use the following strategies, depending on the table properties:
|Type||Great fit for…||Watch out if…|
|Replicated||• Small dimension tables in a star schema with less than 2 GB of storage after compression (~5x compression)||• Many write transactions are on table (such as insert, upsert, delete, update)
• You change Data Warehouse Units (DWU) provisioning frequently
• You only use 2-3 columns but your table has many columns
• You index a replicated table
|Round Robin (default)||• Temporary/staging table
• No obvious joining key or good candidate column
|• Performance is slow due to data movement|
|Hash||• Fact tables
• Large dimension tables
|• The distribution key cannot be updated|
- Start with Round Robin, but aspire to a hash distribution strategy to take advantage of a massively parallel architecture.
- Make sure that common hash keys have the same data format.
- Don’t distribute in varchar format.
- Dimension tables with a common hash key to a fact table with frequent join operations can be hash distributed.
- Use sys.dm_pdw_nodes_db_partition_stats to analyze any skewness in the data.
- Use sys.dm_pdw_request_steps to analyze data movements behind queries, monitor the time broadcast, and shuffle operations take. This is helpful to review your distribution strategy.
Index your table
Indexing is helpful for reading tables quickly. There is a unique set of technologies that you can use based on your needs:
|Type||Great fit for…||Watch out if…|
|Heap||• Staging/temporary table
• Small tables with small lookups
|• Any lookup scans the full table|
|Clustered index||• Tables with up to 100 million rows
• Large tables (more than 100 million rows) with only 1-2 columns heavily used
|• Used on a replicated table
• You have complex queries involving multiple join and Group By operations
• You make updates on the indexed columns: it takes memory
|Clustered columnstore index (CCI) (default)||• Large tables (more than 100 million rows)||• Used on a replicated table
• You make massive update operations on your table
• You overpartition your table: row groups do not span across different distribution nodes and partitions
You might partition your table when you have a large fact table (greater than 1 billion rows). In 99 percent of cases, the partition key should be based on date. Be careful to not overpartition, especially when you have a clustered columnstore index.
With staging tables that require ELT, you can benefit from partitioning. It facilitates data lifecycle management. Be careful not to overpartition your data, especially on a clustered columnstore index.
Learn more about partitions.
If you’re going to incrementally load your data, first make sure that you allocate larger resource classes to loading your data. We recommend using PolyBase and ADF V2 for automating your ELT pipelines into SQL Data Warehouse.
For a large batch of updates in your historical data, first, delete the concerned data. Then make a bulk insert of the new data. This two-step approach is more efficient.
Until auto-statistics are generally available, SQL Data Warehouse requires manual maintenance of statistics. It’s important to update statistics as significant changes happen to your data. This helps optimize your query plans. If you find that it takes too long to maintain all of your statistics, be more selective about which columns have statistics.
You can also define the frequency of the updates. For example, you might want to update date columns, where new values might be added, on a daily basis. You gain the most benefit by having statistics on columns involved in joins, columns used in the WHERE clause, and columns found in GROUP BY.
Learn more about statistics.
SQL Data Warehouse uses resource groups as a way to allocate memory to queries. If you need more memory to improve query or loading speed, you should allocate higher resource classes. On the flip side, using larger resource classes impacts concurrency. You want to take that into consideration before moving all of your users to a large resource class.
If you notice that queries take too long, check that your users do not run in large resource classes. Large resource classes consume many concurrency slots. They can cause other queries to queue up.
Finally, by using the Compute Optimized Tier, each resource class gets 2.5 times more memory than on the Elastic Optimized Tier.
Learn more how to work with resource classes and concurrency.
Lower your cost
A key feature of SQL Data Warehouse is the ability to pause when you’re not using it, which stops the billing of computing resources. Another key feature is the ability to scale resources. Pausing and scaling can be done via the Azure portal or through PowerShell commands.
Optimize your architecture for performance
We recommend considering SQL Database and Azure Analysis Services in a hub-and-spoke architecture. This solution can provide workload isolation between different user groups while also using advanced security features from SQL Database and Azure Analysis Services. This is also a way to provide limitless concurrency to your users.
Learn more about typical architectures that take advantage of SQL Data Warehouse.+