One hotly disputed topic on SAN performance is the matter of contention.

This may materialise in the fabric due to over subscription or excessive fan in. A fibre channel fabric is a network which connects servers and storage through switches, just like an Ethernet network connects us to the internet and to each other in the office.

Ultimately there will be less ports on the actual storage then there are servers and this is where problems may arise,

 

Drawing1.jpg

 

This is a highly simplified drawing illustrating the principles of a core edge network. There are 10 servers each with let us say 2 x 4GB HBAs. Network speeds always take the lowest common denominator so the two connections into the storage will also be at 4GB.† ( This is an over simplification )

Essentially thereís more potential bandwidth than supported by the storage;† best example I can think of might be the security scans at an airport, 20 check in desks but only 2 body scanners.

 

An area of contention can be the actual disks or luns, there are different ways of configuring SAN storage, luns may be mapped to physical sets of disks or mapped virtually to an unspecified number of spindles. ( We wonít go into this here ). If you share disks/luns there is always the possibility of contention. You can try this at home if youíve created more than one partition on your laptop or PC harddrive. Just run a job or task, then run repeat but whilst copying a large file(s) to the other partition.

 

It is possible that the backplane bandwidth may not be sufficient, most storage has a number of ports, these might be allocated as 8GB, 4GB and 1GB and may relates to types of disks or certain trays, 8 x 8GB ports suggests 64GB bandwidth, however this may not be the case Ė check with your friendly vendor.

 

Finally the HBAs and switches may suffer due to buffering, SQL Server tends to push out lots of small io, this is different from a fileserver and there may be latency within the HBA and/or switches.

 

So how can you tell? Well Iíve been testing by running a job at regular intervals, itís up to you what you choose but ideally it should be able to provide consistent run times and results, should be repeatable and ideally portable. It should, in my view, be a sql server test using sql code.

 

Hereís what you might see as results from an hourly test, Red designates failure due to an overrun of time.

The results are hypothetical, but one might deduce that 18:00 shows users going home, the red areas may indicate this when backups occur, the 13:00 run is better, lunchtime maybe?

If your application is 7 x 24 then this type of result is bad news.

Bear in mind this is from a test so the results are not affected by lunch breaks or going home, but most likely the SAN is.

Some typical contention I have encountered over the years has come from Exchange, this manifested as increase io latency with no change in application load, yet another subject area!

 

Date Ė start time

Duration

( h:m:s )

12/01/2009 08:05

00:34:35

12/01/2009 07:05

00:36:32

12/01/2009 06:05

00:33:42

12/01/2009 05:05

00:44:58

12/01/2009 04:05

00:37:19

12/01/2009 03:05

00:43:32

12/01/2009 02:05

00:54:59

12/01/2009 01:05

00:55:00

12/01/2009 00:05

00:54:59

11/01/2009 23:05

00:54:59

11/01/2009 22:05

00:54:59

11/01/2009 21:05

00:54:59

11/01/2009 20:05

00:46:09

11/01/2009 19:05

00:34:32

11/01/2009 18:05

00:23:43

11/01/2009 17:05

00:27:58

11/01/2009 16:05

00:28:39

11/01/2009 15:05

00:34:25

11/01/2009 14:05

00:32:45

11/01/2009 13:05

00:29:47

11/01/2009 12:05

00:33:36

11/01/2009 11:05

00:29:41

11/01/2009 10:05

00:33:51

11/01/2009 09:05

00:36:48

11/01/2009 08:05

00:31:22