Statistical Data in SQL Server 2008
Microsoft SQL Server 2008 collects statistics about individual columns (single-column statistics) or sets of columns (multicolumn statistics). Statistics are used by the query optimizer to estimate the selectivity of expressions, and thus the size of intermediate and final query results. Good statistics allow the optimizer to accurately assess the cost of different query plans and then choose a high-quality plan. All information about a single statistics object is stored in several columns of a single row in the sysindexes table, and in a statistics binary large object (statblob) kept in an internal-only table. In addition, information about statistics can be found in the new metadata views sys.stats and sys.indexes.
Summary of Statistics Features
SQL Server 2008 has a number of features for maintaining statistics. The most important one is the ability to automatically create and update statistics. This feature, also called auto stats, is on by default starting in SQL Server 2000. Approximately 98% of SQL Server 2000 installations leave this feature enabled, which is a best practice. For the majority of databases and applications, developers and administrators can rely on the automatic creation and update of statistics to provide comprehensive and accurate statistics about their data. This allows the SQL Server 2008 query optimizer to produce good query plans consistently, while keeping development and administration costs low. If you need more control over statistics creation and update to get good query plans and manage the overhead of gathering statistics, you can use manual statistics creation and update capabilities.
An important feature for high-throughput online transaction processing application environments is the ability to asynchronously update statistics automatically. This can improve predictability of query response time in such environments.
SQL Server 2008 statistics features allow you to:
· Implicitly create and update statistics with the default sampling rate (in the SELECT, INSERT, DELETE, UPDATE and MERGE commands, use of a column in a query condition such as a WHERE or JOIN clause causes statistics to be created or updated if necessary when automatic statistics creation and update is enabled).
· Manually create and update statistics with any desired sampling rate, and drop statistics (CREATE STATISTICS, UPDATE STATISTICS, DROP STATISTICS, CREATE INDEX, DROP INDEX).
· Create and maintain filtered statistics, which are defined over a subset of the rows in the table by using a WHERE clause.
· Automatically create filtered statistics whenever a filtered index is created.
· Manually create statistics in bulk for all columns of all tables in a database (sp_createstats).
· Manually update all existing statistics in the database (sp_updatestats).
· List statistics objects that exist for a table or database (sp_helpstats, catalog views sys.stats, sys.stats_columns).
· Display descriptive information about statisticsobjects (DBCC SHOW_STATISTICS).
· Enable and disable automatic creation and update of statistics database-wide or for a specific table or statistics object (ALTER DATABASE options AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS, sp_autostats, and the NORECOMPUTE option on CREATE STATISTICS and UPDATE STATISTICS).
· Enable and disable asynchronous automatic update of statistics (ALTER DATABASE option AUTO_UPDATE_STATISTICS_ASYNC).
In addition, SQL Server Management Studio allows you to graphically browse and control statistics objects within its Object Explorer view. Statistics are listed in Object Explorer in a folder underneath each table object.
What’s New for Statistics in SQL Server 2008.
SQL Server 2008 provides the option to create filtered statistics, which are restricted to a subset of the rows in the table defined by a predicate, also called a filter. Filtered statistics are either created explicitly, or more commonly, through creating a filtered index.
* Automatic creation: As mentioned, filtered statistics are automatically created as a by-product of filtered indexes. Unlike regular statistics, filtered statistics are never automatically created by the query optimizer as a result of query and update statements.
* Automatic update: To determine when statistics need to be automatically updated, the query optimizer uses column modification counters. For filtered statistics, the counter is multiplied by the selectivity of the filter, which ranges from 0 to 1, to compensate for the fact that the column counter tracks changes to all rows in the table, not the changes to the rows in the filtered statistics object.
* Sampling: To build the statistics, SQL Server 2008 reads every nth data page, according to the sampling rate, and then it applies the filter, if any.
* If you don’t specify sampling rate, the default behavior is to sample based on the number of rows in the table and, for filtered statistics, based on the selectivity of the filter. The more selective the filter, the more rows need to be sampled to gather enough data to build the histogram.
* Cardinality estimate: As with regular statistics, the optimizer attempts to pick the most relevant statistics when performing cardinality estimate. There is an additional check that the optimizer must make for filtered statistics: whether the statistics predicate contains the query predicate (that is, whether the query predicate must be true for a row if the statistics predicate is true for that row). Out of the possible filtered statistics, the query optimizer favors those that are as close to the query predicate as possible.
Other Important Enhancements in Statistics.
The previous version, SQL Server 2005, introduced a number of important statistics features that allow the query optimizer to choose better query plans for a broader range of queries, or otherwise improve the management of statistics:
· Large object support: Columns of type ntext, text, and image, as well as the new types nvarchar(max), varchar(max), and varbinary(max) can be specified as statistics columns.
· Improved statistics loading framework: The optimizer internally improves the loading of statistics compared with SQL Server 2000. It now loads all and only the statistics it needs, improving optimization result quality and performance.
· Minimum sample size: A minimum of 8 megabytes (MB) of data, or the size of the table if smaller, are now sampled during statistics gathering.
· Increased limit on number of statistics: The number of column statistics objects allowed per table has been increased to 2,000. An additional 249 index statistics may also be present, for a maximum of 2,249 statistics per table.
· Clearer and more consistent display of histograms: DBCC SHOW_STATISTICS is improved because histograms are always scaled before they are stored in the catalogs.
· Enhanced DBCC SHOW_STATISTICS output: DBCC SHOW_STATISTICS now displays the name of the statistics object being displayed, to avoid ambiguity.
· Single rowset output for DBCC SHOW_STATISTICS: DBCC SHOW_STATISTICS supports options to output the header, density vector, and histogram individually as single rowsets. This allows easier programming when DBCC SHOW_STATISTICS output is automatically processed.
· Statistics on internal tables: Statistics are fully supported on the tables listed in sys.internal_tables, including XML and full-text indexes, Service Broker queues, and query notification tables.
· Statistics on up to 32 columns: The limit on the number of columns in a statistics object has been increased to 32 from 16.
· Statistics on partitioned tables: Statistics are fully supported on partitioned tables, which were introduced in SQL Server 2005. Histograms are maintained on a per-table basis (not per-partition).
· Parallel statistics gathering for fullscan: For statistics gathered with fullscan, creation of a single statistics object can be done in parallel for both nonpartitioned and partitioned tables.
· Improved recompiles and statistics creation in case of missing statistics: In the case where auto create of statistics failed, on a subsequent execution of a plan generated with missing statistics, auto create is performed and the plan is recompiled; the missing statistics condition does not persist.
· Improved recompilation logic and statistics update for empty tables: Changing from 0 to > 0 rows in a table results in query recompilation and update of statistics.
· Inferred date correlation constraints: By enabling the DATE_CORRELATION_OPTIMIZATION database setting, you can cause SQL Server to maintain information about the correlation between datetime fields across a pair of tables linked by a foreign key. This information is used to allow implied predicates to be determined for some queries. The information is not used directly for selectivity estimation or costing by the optimizer, so it is not statistics in the strictest sense, but it is closely related to statistics because it is auxiliary information used to help obtain a better query plan.
· sp_updatestats: In SQL Server 2005 and SQL Server 2008, sp_updatestats updates only those statistics that require updating based on the rowmodctr information in the sys.sysindexes compatibility view, thus eliminating unnecessary updates of unchanged items. For databases running under compatibility level 90 or higher, sp_updatestats preserves the automatic UPDATE STATISTICS setting for any particular index or statistics.
There are also some other minor changes in behavior for statistics. In particular, the statblob column in sys.sysindexes is now always given as NULL. The statblob itself is kept in an internal-only catalog table.
We define the following terms related to SQL Server 2008 statistics:
· statblob: Statistics binary large object. This object is stored in the system catalogs in an internal catalog view, sys.sysobjvalues.
· String summary: A form of statistics that summarizes the frequency distribution of substrings in a string column. This is used to help estimate selectivity of LIKE predicates. It is stored in the statblob for a string column.
· sysindexes: The sys.sysindexes catalog view, which contains information about tables and indexes.
· Predicate: A condition that evaluates to true or false. Predicates appear in a WHERE or JOIN clause in a database query.
· Selectivity: The fraction of rows from the input set of the predicate that satisfy the predicate. More sophisticated selectivity measures are also used to estimate the number of rows produced by joins, DISTINCT, and other operators. For example, SQL Server 2008 estimates the selectivity of the predicate “Sales.SalesOrderHeader.OrderID = 43659” in the AdventureWorks database as 1/31465 = 0.00003178.
· Cardinality estimate: An estimate of the size of a result set. For example, if a table T has 100,000 rows and a query contains a selection predicate of the form T.a=10, and a histogram shows that the selectivity of T.a=10 is 10%, then the cardinality estimate for the fraction of rows of T that must be considered by the query is 10% * 100,000 = 10,000.
· LOB: Large object (value of type text, ntext, image, varchar(max),nvarchar(max), varbinary(max)).
· Filter: a condition that is evaluated to determine whether a row must be part of the filtered statistics. The predicate appears in the WHERE clause of the CREATE STATISTICS or CREATE INDEX statements (in the case when statistics are automatically created as a side effect of creating an index).
Statistics Collected by SQL Server 2008
SQL Server 2008 maintains the following information at the table level. These are not part of a statistics object, but SQL Server 2008 uses them in some cases during query cost estimation:
· Number of rows in the table or index (rows column in sys.sysindexes)
· Number of pages occupied by the table or index (dpages column in sys.sysindexes)
SQL Server 2008 collects the following statistics about table columns and stores them in a statistics object (statblob):
· The time the statistics were collected.
· The number of rows used to produce the histogram and density information (described later in this section). (The row heading is “Rows sampled.”)
· The average key length.
· A single-column histogram, including the number of steps.
· A string summary, if the column contains character data. DBCC SHOW_STATISTICS output contains a column String Index, which has the value YES if a statistics object contains a string summary.
· The estimated number of rows matching the filter (for filtered statistics); or all rows in the table (for regular statistics). (The row heading is “Rows.”)
· All rows in the table. (The row heading is “Unfiltered rows.”)
A histogram is a set of up to 200 values of a given column. All or a sample of the values in a given column are sorted; the ordered sequence is divided into up to 199 intervals so that the most statistically significant information is captured. In general, these intervals are of nonequal size. The following values, or information sufficient to derive them, are stored with each step of the histogram.
Shows the upper boundary of a histogram step. This is a key value.
Specifies how many rows are inside the range (they are smaller than this RANGE_HI_KEY, but bigger than the previous smaller RANGE_HI_KEY).
Specifies how many rows are exactly equal to RANGE_HI_KEY.
Specifies the average number of rows per distinct value inside the range.
Specifies how many distinct key values are inside this range (not including the previous key before RANGE_HI_KEY and RANGE_HI_KEY itself).
Histograms in SQL Server 2008 are only built for a single column—the first column in the set of key columns of the statistics object.
SQL Server 2008 builds the histogram from the sorted set of column values in three steps:
1. Histogram initialization: In the first step, a sequence of values starting at the beginning of the sorted set is processed, and up to 200 values of RANGE_HI_KEY, EQ_ROWS, RANGE_ROWS, and DISTINCT_RANGE_ROWS are collected (RANGE_ROWS and DISTINCT_RANGE_ROWS are always zero during this step). The first step ends either when all input has been exhausted, or when 200 values have been found.
2. Scan with bucket merge: Each additional value from the leading column of the statistics key is processed in the second step, in sorted order; for each successive value, either the value is added to the last range, or a new range at the end is created (this is possible because the input values are sorted). If a new range is created, one pair of existing, neighboring ranges is collapsed into a single range. This pair of ranges is selected in order to minimize information loss. The number of histogram steps after the ranges are collapsed stays at 200 throughout this step. This method is based on a variation of the maxdiff histogram.
3. Histogram consolidation: In the third step, more ranges can be collapsed if a significant amount of information is not lost. Therefore, even if the column has more than 200 unique values, the histogram might have less than 200 steps.
If the histogram has been built using a sample, the values of RANGE_ROWS, EQ_ROWS, DISTINCT_RANGE_ROWS and AVG_RANGE_ROWS are estimated and therefore they do not need to be whole integers.
Density is information about the number of duplicates in a given column or combination of columns, and it is calculated as 1/(number of distinct values). If a column is used in an equality predicate, the number of qualifying rows is estimated by using the density derived from the histogram. Histograms are also used to estimate the selectivity of nonequality selection predicates, joins, and other operators.
In addition to the timestamp showing the time the statistics were gathered, the number of rows in the table, the number of rows estimated to match the filter, the number of rows sampled to produce the histogram, the Density information and average key length, and the histogram itself, single-column statistics information includes an All density value for each set of columns forming a prefix of the statistics column set. This is shown in the second rowset output by DBCC SHOW_STATISTICS. All density is an estimate of 1/(number of distinct values in the prefix column set). The next section gives an example of this.
Note: The Density value included in the first row returned by dbcc show_statistics is the density of all values sampled other than the RANGE_HI_KEY values. The RANGE_HI_KEY values are typically the more frequent values in the distribution. Hence, the displayed Density value gives potentially useful information about the density of nonfrequent values.
The multicolumnstatistics for one set of columns consist of one histogram for the first column in the statistics definition, one density value for the first column, and an All Density value for each prefix combination of columns (including the first column alone). Each set of multicolumn statistics (a histogram and two or more density values) is stored in one statblob together with the timestamp of the last statistics update, the number of rows in the sample used to produce the statistical information, the number of steps in the histogram, and the average length of the key. A string summary is included for only the first column, only if it contains character data. Multicolumn densities are not supported for filtered statistics.
Use sp_helpindex and sp_helpstats to display the list of all statistics available for a given table. sp_helpindex lists all indexes on the table, and sp_helpstats lists all the statistics on the table. Each index also carries the statistical information for its columns. The statistical information created by using the CREATE STATISTICS command is equivalent to the statistics built by a CREATE INDEX command on the same columns. The only difference is that the CREATE STATISTICS command uses sampling by default while the CREATE INDEX command gathers the statistics with fullscan, because it has to process all rows for the index anyway.