SQL Data Warehouse is a cloud-based Enterprise Data Warehouse (EDW) that leverages Massively Parallel Processing (MPP) to quickly run complex queries across petabytes of data. Use SQL Data Warehouse as a key component of a big data solution. Import big data into SQL Data Warehouse with simple PolyBase T-SQL queries, and then use the power of MPP to run high-performance analytics. As you integrate and analyze, the data warehouse will become the single version of truth your business can count on for insights.
SQL Data Warehouse is a key component of an end-to-end big data solution in the Cloud.
In a cloud data solution, data is ingested into big data stores from a variety of sources. Once in a big data store, Hadoop, Spark, and machine learning algorithms prepare and train the data. When the data is ready for complex analysis, SQL Data Warehouse uses PolyBase to query the big data stores. PolyBase uses standard T-SQL queries to bring the data into SQL Data Warehouse.
SQL Data Warehouse stores data into relational tables with columnar storage. This format significantly reduces the data storage costs and improves query performance. Once data is stored in SQL Data Warehouse, you can run analytics at massive scale. Compared to traditional database systems, analysis queries finish in seconds instead of minutes or hours instead of days.
The analysis results can go to worldwide reporting databases or applications. Business analysts can then gain insights to make well-informed business decisions.
SQL Data Warehouse offers performance tiers designed for flexibility to meet your data needs, whether big or small. You can choose a data warehouse that is optimized for elasticity or for compute.
When you know in advance the primary operations and queries to be run in your data warehouse, you can prioritize your data warehouse architecture for those operations. These queries and operations might include:
Knowing the types of operations in advance helps you optimize the design of your tables.
First, load your data into Azure Data Lake Store or Azure Blob storage. Next, use PolyBase to load your data into SQL Data Warehouse in a staging table. Use the following configuration:
Design | Recommendation |
---|---|
Distribution | Round Robin |
Indexing | Heap |
Partitioning | None |
Resource Class | largerc or xlargerc |
Learn more about data migration, data loading, and the Extract, Load, and Transform (ELT) process.
Use the following strategies, depending on the table properties:
Type | Great fit for… | Watch out if… |
---|---|---|
Replicated | • Small dimension tables in a star schema with less than 2 GB of storage after compression (~5x compression) | • Many write transactions are on table (such as insert, upsert, delete, update) • You change Data Warehouse Units (DWU) provisioning frequently • You only use 2-3 columns but your table has many columns • You index a replicated table |
Round Robin (default) | • Temporary/staging table • No obvious joining key or good candidate column | • Performance is slow due to data movement |
Hash | • Fact tables • Large dimension tables | • The distribution key cannot be updated |
Tips:
Learn more about replicated tables and distributed tables.
Indexing is helpful for reading tables quickly. There is a unique set of technologies that you can use based on your needs:
Type | Great fit for… | Watch out if… |
---|---|---|
Heap | • Staging/temporary table • Small tables with small lookups | • Any lookup scans the full table |
Clustered index | • Tables with up to 100 million rows • Large tables (more than 100 million rows) with only 1-2 columns heavily used | • Used on a replicated table • You have complex queries involving multiple join and Group By operations • You make updates on the indexed columns: it takes memory |
Clustered columnstore index (CCI) (default) | • Large tables (more than 100 million rows) | • Used on a replicated table • You make massive update operations on your table • You overpartition your table: row groups do not span across different distribution nodes and partitions |
You might partition your table when you have a large fact table (greater than 1 billion rows). In 99 percent of cases, the partition key should be based on date. Be careful to not overpartition, especially when you have a clustered columnstore index.
With staging tables that require ELT, you can benefit from partitioning. It facilitates data lifecycle management. Be careful not to overpartition your data, especially on a clustered columnstore index.
Learn more about partitions.
If you’re going to incrementally load your data, first make sure that you allocate larger resource classes to loading your data. We recommend using PolyBase and ADF V2 for automating your ELT pipelines into SQL Data Warehouse.
For a large batch of updates in your historical data, first, delete the concerned data. Then make a bulk insert of the new data. This two-step approach is more efficient.
Until auto-statistics are generally available, SQL Data Warehouse requires manual maintenance of statistics. It’s important to update statistics as significant changes happen to your data. This helps optimize your query plans. If you find that it takes too long to maintain all of your statistics, be more selective about which columns have statistics.
You can also define the frequency of the updates. For example, you might want to update date columns, where new values might be added, on a daily basis. You gain the most benefit by having statistics on columns involved in joins, columns used in the WHERE clause, and columns found in GROUP BY.
Learn more about statistics.
SQL Data Warehouse uses resource groups as a way to allocate memory to queries. If you need more memory to improve query or loading speed, you should allocate higher resource classes. On the flip side, using larger resource classes impacts concurrency. You want to take that into consideration before moving all of your users to a large resource class.
If you notice that queries take too long, check that your users do not run in large resource classes. Large resource classes consume many concurrency slots. They can cause other queries to queue up.
Finally, by using the Compute Optimized Tier, each resource class gets 2.5 times more memory than on the Elastic Optimized Tier.
Learn more how to work with resource classes and concurrency.
A key feature of SQL Data Warehouse is the ability to pause when you’re not using it, which stops the billing of computing resources. Another key feature is the ability to scale resources. Pausing and scaling can be done via the Azure portal or through PowerShell commands.
We recommend considering SQL Database and Azure Analysis Services in a hub-and-spoke architecture. This solution can provide workload isolation between different user groups while also using advanced security features from SQL Database and Azure Analysis Services. This is also a way to provide limitless concurrency to your users.
Learn more about typical architectures that take advantage of SQL Data Warehouse.+