Just a few pointers on SQL Server Query Tuning.

Version : 0.0
Date : 25/09/2014
Level: : Very basic
By : Albert van der Sel


This short and *very* simple note, is not about investigating waits, diskaccess, and all the "physical stuff", in performance analysis.
So, if you have interest in viewing "waits", "physical database design", using perfstats, using the profiler, viewing IO on database files,
and these sort of things, you might take a look at this note.

Contrary, this note tries to say something usefull on "queries" themselves (design and execution). Hope it works...


1.The "Query Optimizer".

A component of the database engine is the "Query Optimizer". It uses a method called "Cost Based Optimizing",
and it evaluates queries according to all algolrithmes available in that method.

The "costs" are in terms of CPU cycles and IO access costs. These are always estimates, mainly based on the available statistics
associated with the objects in the query (tables, columns, and indexes).

The Optimzer ofcourse does not have detailed information on the storage subsystem (are these local disks? Or LUNs on some SAN? etc..).
These are "hidden", and are only known by the Operating System (IO Manager and drivers).
Also, the Optimizer cannot really "appreciate" time (like we can), and thus it will give us only "costs" in terms of CPU cycles,
and what the abstract IO operations costs are, in order to access all needed rows from the diskblocks.

Now, it may produce several plans for a certain query, at different moments. This is, among other factors, dependent on the fact
if it will take advantage of indexes and/or if the statistics are "stale" or up-to-date. So, suppose some indexes do not seem to have any use anymore,
maybe it then creates a plan using another access path.

It's important to know that the Optimizer will always create an execution plan for any query.
The Optimizer attempts to generate a plan where the sequence of actions consumes the minimum of CPU and IO operations.

In the process of creating a plan, the Optimizer takes a look at the column list that needs to be in the output, the indexes which are present,
the type of "join" (if that applies), the column on which the tables are joined, and most importantly, the "actual statistics" on the objects.


So, SQL Server (first the Parser, and then the Optimizer) does these things:

- The Parser "Parses" (examines) the query: is it correct or are there (syntax) errors?
- The Optimizer analyzes the query
- The Optimizer inspects the statistics of relevant objects.
- The Optimizer produces the "best" query plan (it believes to have the least IO operations).
- SQL Server executes the query on basis of the plan.

Let's now see what those "statistics" are.


2. Statistics.

If a column of a table has a lot of different values, then it is said that is has a "high selectivity".
For example, "cust_id" should have a high "selectivity", since each customer should be unique.
Columns with "low selectivity" are for example a "country" column, or "gender" (which can oly be "M" or "F").

You can translate that to the "density" of a certain column (or combination of columns), since the density is defined as
Density = 1/[# of distinct values in the column (or columns)]

Now, how an Optimizer exactly works, is a subject for Research- and IT University folks. No "chips" here.

But in simple words it's a bit like this: can you imagine that the Optimizer inspects a column where statistics are defined on, for example,
because an index is created for that column?
Then, it will think of "steps", or "intervals" with discrete ranges of the possible values of that column.
Then it will calculate how many occurences of rows falls in these "steps", with the "average row length", and other information.
If the optimizer has an idea, of how many rows fall in all these intervals, it then can "reasonably" estimate the IO operations, and CPU cycles,
which are neccessary to retrieve rows (for a query).

So, "statistics" contain metrics on the data count and distribution for a column (or columns).

When does the Optimizer create statistics?
  1. The Optimizer will create stats automatically for any column where an index is created for, or when you explicitly have used
    the "CREATE STATISTICS .." command for a column (or set of columns) of a table.

  2. Other columns will not be inspected "automatically", unless the "Auto Create Statistics" is "True", which is a database property.
    To see that property, in SSMS, rightclick the database, and choose "properties", then choose "Options".
Here is an extremely simple example.

In a new, empty test database, let's create the following tables. Then, let's insert a few sample records (it's all very small).

-- Create the 3 tables: LOC (locations), DEPT (departments), EMP (employees). It's a simple relational model with 3 linked tables.

create table LOC
(
LOCID int,
CITY varchar(16),
constraint pk_loc primary key (locid) -- is automatically a unique index
)

create table DEPT
(
DEPID int,
DEPTNAME varchar(16),
LOCID int,
constraint pk_dept primary key (depid), -- is automatically a unique index
constraint fk_dept_loc foreign key (locid) references loc(locid)
)

create table EMP
(
EMPID int,
EMPNAME varchar(16),
DEPID int,
SAL int,
constraint fk_emp_dept foreign key (depid) references dept(depid)
)


-- INSERT a few sample records:

INSERT INTO LOC VALUES (1,'Amsterdam');
INSERT INTO LOC VALUES (2,'Haarlem');
INSERT INTO LOC VALUES (3,null);
INSERT INTO LOC VALUES (4,'Utrecht');

INSERT INTO DEPT VALUES (1,'Sales',1);
INSERT INTO DEPT VALUES (2,'PZ',1);
INSERT INTO DEPT VALUES (3,'Management',2);
INSERT INTO DEPT VALUES (4,'RD',3);
INSERT INTO DEPT VALUES (5,'IT',4);

INSERT INTO EMP VALUES (1,'Joop',1,1000);
INSERT INTO EMP VALUES (2,'Gerrit',2,500);
INSERT INTO EMP VALUES (3,'Harry',2,2000);
INSERT INTO EMP VALUES (4,'Christa',3,900);
INSERT INTO EMP VALUES (5,null,4,3000);
INSERT INTO EMP VALUES (6,'Nina',5,5000);
INSERT INTO EMP VALUES (7,'Nadia',5,4000);

-- Now, just to be sure, let's update all statistics manually:

exec sp_updatestats

-- Now run this:

DBCC SHOW_STATISTICS(LOC,pk_loc)

What do we see? Take notice of figure 1.


Fig 1. DBCC SHOW_STATISTICS on an index of a table.



In figure 1, you might see that the Optimizer came up with 4 steps (intervals). Now, in this case,
there were only 4 records in the table. But even in case of a very large table, the number of steps stays
limited, say, a couple of hundreds of steps at most.

You can also see that it has determined (or estimated) how many records fall in each interval.
You can see info about the "density". Then there is stuff like "RANGE_HIGH_KEY", "RANGE_ROWS" etc..,
which is intel established by the optimizer on the highest value per interval, how many rows are in it, etc..

The fact that these sorts of "histograms", provides the Optimizer how many rows are to be expected per interval,
leads to the fact that the Optimizer is able to "estimate" the costs of retrieving certain rows.

All in all, I suppose this tells us a bit about how the Optimizer thinks.
But we are not ready. Up to now, we only dealt about statistics on an index of a certain table (pk_loc on table LOC).

So, what about a large join of multiple tables? It will be more or less the same way. It will create a "tree" of nodes
where on each node of that tree, which is a subresult of the whole query, a subcost will be estimated.

Below you see such a tree. Here we have used a query which joins all our 3 tables. However, before we really execute it,
we have "requested" SQL Server us to show the execution plan. From Management Studio, you can do that by first writing you query in
the Query window, en next click "Query" in Management Studio main menu, en next choose "Display Estimated Execution Plan".


Fig 2. "Complex" join query, using EMP, DEPT, and LOC.



The total cost of the (whole) query will be (around) 100%. Per node, you can see the relative cost, which when added,
should again result in (about) 100%.

Did you notice that at two nodes, a clustered index seek is used? And one tablescan of table EMP.

Ofcourse, in the create statements of LOC, DEPT, and EMP, you probably have noticed that a Primary Key was defined on LOC and DEPT only.
Per default, a Primary Key definition corresponds to a unique clustered index.
And we did not defined a primary key (or any other index) on the EMP table.

Well, since our sample tables are so small, in this case, it's even less work to do just a tablescan, instead of accessing an Index first.

But in general, you would say that "table scans" should be tried to be avoided as much as possible, and hopefully, the Optimizer
has found usable indexes to use instead.

We come back to this sort of analysis later.

Next, lets first see how to manage statistics on objects.


3. Managing Statistics.