Iíve noticed that of late Iíve become a bit more critical of a well known publication that I suspect many DBAís read. I have subscriptions to a number of publications and for the ones that I pay for Iím generally quite content.

I like paper / hard copy because I can make use of time where itís just not practical to have a laptop or PC Ė and I spend my working day in front of a screen so sometimes itís nice to escape to the garden or the pub or on the train, say travelling to a SSUG meeting without a PC.


So whatís set GrumpyOldDba off now you may ask? Well thereís an article been published about removing duplicate indexes Ė the Microsoft guys blogged this, with code, ages ago and I spent some time looking at this type of analysis as part of my index analysis series.

However itís not quite so easy as it sounds and leaving aside duplicated single column non clustered indexes, which are obviously bad, the actual matter of where indexes have the same columns but in different orders or contained within is quite a complex matter and worth a whole lot of time on its own.

( Hereís the original blog from June 2007† http://blogs.msdn.com/mssqlisv/archive/2007/06/29/detecting-overlapping-indexes-in-sql-server-2005.aspx)


Idx_anytable_index1

†SalesID

PropertyID

SaleStartDate

SaleEndDate

UserID

IsCurrent

Idx_anytable_index2

†SalesID

UserID

PropertyID

SaleStartDate

SaleEndDate

IsCurrent

Idx_anytable_index3

†SalesID

PropertyID

SaleEndDate

IsCurrent

 

 

Idx_anytable_index4

†PropertyID

SalesID

SaleStartDate

SaleEndDate

UserID

IsCurrent


Taking the above indexes to be all standard non clustered indexes we can say that thereís a fair bit of duplication and in fact index 3 is already contained within index 1

Sadly with anything like this it just depends, and generally Iíd say that index 3 is probably unwanted Ė however it may cover a query exactly, in which case index 1 might be ignored, and in certain joins one index may actually be preferred over another.

So what Iím saying is that other than basic duplicates itís actually quite tricky, obviously you can check the dmvís ( sql 2005 onwards ) to actually see if the index is being used, if youíre using 2000 or 7.0 then you canít.


So what really got me worked up was the statement that creating a secondary index over a clustered index was a bad idea, now to me that such an article appears in this magazine is very disappointing and Iím also disappointed with the author Ė to me it shows a lack of editorial knowledge and like the review of the backup utilities last month isnít what Iíd expect from such a respected magazine. Iím also frankly becoming less and less impressed by the fact that many articles are only partly published and you have to go to the website for the full article. Maybe itís a sign of the times that we donít actually want paper copy any more.


Anyway hereís a very simple example of secondary indexes over the top of clustered indexes. The table I used has 2 million rows and is just under 16gb in size ( Letís do real world examples here <grin> ) Itís a variation of the table in used in† ďAnalysing Index Part 4 Ė Size does matterĒ


http://www.grumpyolddba.co.uk/sql2005/working%20with%20indexes%204.mht


This is the test table and the code used to populate it. You might want to find other ways of populating the table if youíre not running on a server as I figure this will run for a few hours on a laptop! There were distinct reasons why I wanted to generate data this way for some previous tests, the only important factor for this tests is that each row should exceed half a page so as to force one row per page.


create table dbo.TestTable

(

NumKey int identity(1,1)not null,

cGuid uniqueidentifier default newid()null,

cSguid uniqueidentifier default NEWSEQUENTIALID() not null,

cBig bigint null,

cInt int null,

cDate1 datetime not null,

cDate2 datetime null,

cVchar1 varchar(50) null,

cChar1 char(20) null,

cVchar2 varchar(500) null,

cChar2 char(100),

Pad1 char(4000) not null

);

go

--

--

set nocount on

--

declare @num bigint,@count int,@count2 int,@count3 int

set @count = 1

set @count2 = 12

set @count3 = 1

while @count<2000001

begin

IF (@count%2)=0

††††††††††† begin

††††††††††††††††††††††† insert into dbo.TestTable(cBig,cInt,cDate1,cDate2,cVchar1,cChar1,cVchar2,cChar2,pad1)

††††††††††††††††††††††† values( 36854775807+(36854775807/@count),@count2*3,dateadd(dd,@count3,'25 MAY 2007'),dateadd(mi,@count,getdate()),

††††††††††††††††††††††† reverse(convert(varchar(50),dateadd(dd,@count3,'25 MAY 2007'))),'0000'+convert(varchar(7),@count)+convert(varchar(4),@count3),

††††††††††† ††††††††††† replicate(convert(char(36),newid()),10),convert(varchar(20),datename(dw,getdate()+@count)+datename(month,getdate()+@count)) +'blahblahblahblahblahblahblahblahblahblahblahblah',

††††††††††††††††††††††† replicate(convert(char(36),newid()),110))

††††††††††† end

else

††††††††††† begin

††††††††††††††††††††††† insert into dbo.TestTable(cBig,cInt, cDate1,cDate2,cVchar1,cChar1,cVchar2,cChar2,pad1)

††††††††††††††††††††††† values( 854775807-(36854775807/@count),@count2*2,dateadd(dd,@count3,'5 dec 2006'),dateadd(mi,@count,getdate()),

††††††††††††††††††††††† reverse(convert(varchar(50),dateadd(dd,@count3,'25 MAY 2007'))),'0000'+convert(varchar(7),@count)+convert(varchar(4),@count3),

††††††††††††††††††††††† replicate(convert(char(36),newid()),10),convert(varchar(20),datename(dw,getdate())+datename(month,getdate())) +'blahblahblahblahblahblahblahblahblahblahblahblah',

††††††††††††††††††††††† replicate(convert(char(36),newid()),110))

††††††††††† end

set @count = @count+1;

set @count2 = @count2+2;

IF @count2>50

††††††††††† set @count2 = 12;

IF (@count%3000)=0

††††††††††† set @count3 = @count3+1;

 

end  

 


ō  Hereís the output from sp_spaceused



ō  and from sys.dm_db_index_physical_stats



ō  Hereís my table after adding a clustered primary key on the integer identity column, NumKey



ō  Extracting index information from sys.dm_db_index_physical_stats


†select index_id,index_type_desc,index_depth,index_level,page_count,record_count,min_record_size_in_bytes, max_record_size_in_bytes,avg_record_size_in_bytes

††††††††††† from sys.dm_db_index_physical_stats (db_id(), object_id('testtable'), null ,null ,'detailed' )

 


And the results


index_id

index_type_desc

index_depth

index_level

page_count

record_count

min_record_size_in_bytes

max_record_size_in_bytes

avg_record_size_in_bytes

1

CLUSTERED

4

0

2000000

2000000

4577

4577

4577

1

CLUSTERED

4

1

3218

2000000

11

11

11

1

CLUSTERED

4

2

8

3218

11

11

11

1

CLUSTERED

4

3

1

8

11

11

11


Select 50 rows from the table by PK


select * from dbo.TestTable where NumKey between 50 and 100


(51 row(s) affected)

Table 'TestTable'. Scan count 1, logical reads 56, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


This is fair enough because at 4,577 bytes per row it requires 51 page requests to return the data rows


 

The entire row

 


Just the Numkey column

 


The in statement

 


ō  If we just select the integer PK column it still requires the same io to return the result set


select numkey from dbo.TestTable where NumKey between 50 and 100

 

(51 row(s) affected)

Table 'TestTable'. Scan count 1, logical reads 56, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


ō  Just out of pure interest and as part of another test I was running I dropped the 51 values into an IN statement


select numkey from dbo.testtable where numkey

in (50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,

71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,

92,93,94,95,96,97,98,99,100)


(51 row(s) affected)


Table 'TestTable'. Scan count 51, logical reads 204, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


ō  The same for two values?


select numkey from dbo.testtable where numkey in (50,51)

 

(2 row(s) affected)

Table 'TestTable'. Scan count 2, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


So thatís making use of the clustered PK on the integer column, returning just the integer value.


Now create a non clustered secondary index over the PK


create index idx_TestTable_Numkey on dbo.TestTable(Numkey);

go


index_id

index_type_desc

index_depth

index_level

page_count

record_count

min_record_size_in_bytes

max_record_size_in_bytes

avg_record_size_in_bytes

1

CLUSTERED INDEX

4

0

2000000

2000000

4577

4577

4577

1

CLUSTERED INDEX

4

1

3218

2000000

11

11

11

1

CLUSTERED INDEX

4

2

8

3218

11

11

11

1

CLUSTERED INDEX

4

3

1

8

11

11

11

2

NONCLUSTERED INDEX

3

0

1732

2000000

5

5

5

2

NONCLUSTERED INDEX

3

1

4

1732

11

11

11

2

NONCLUSTERED INDEX

3

2

1

4

11

11

11


Now the original query


select numkey from dbo.TestTable where NumKey between 50 and 100


(51 row(s) affected)

Table 'TestTable'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


Now in the overall scheme of things 56 to 3 io may not seem very much, but if this query runs maybe several times a second then there could be some big savings.

 

select numkey from dbo.testtable where numkey

in (50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,

71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,

92,93,94,95,96,97,98,99,100)

 

(51 row(s) affected)Table 'TestTable'. Scan count 51, logical reads 153, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


And for the IN Down 25%.


 

 

 

 

Conclusion:



To assume that the Primary Key will only be used for singleton selects is ďstrangeĒ Ė† and to assume that the PK will always be the clustered index is also unhelpful.†

In real world applications it is not uncommon to find wide Primary Keys and wide clustered indexes, all on wide tables.

Which ever way you cut it the clustered index is always the width of the entire row and io will reflect that.

Where the Clustered Primary Key is several columns wide itís more than likely youíll find range selects against it and here a covering non clustered index can save you considerable io.


Not all databases are the same and what holds true for one may certainly not hold true for another, but sweeping generalisations are a problem, especially when theyíre in a respected publication.


Itís also difficult for the DBA who has to try to explain to a Manager why whatís in such an article is wrong, I remember unsuccessfully spending over a year trying to explain what the lock waits counters actually represented to senior management because someone had read they actually meant something else and they had convinced themselves this was true Ė it wasnít !


© www.grumpyolddba.co.uk