Wednesday, September 13, 2017

Cascade Blockage

Sometimes the answer to why a SQL process gets blocked relates to overall network dataflow. In these cases you may not be able to tease out the issue until some event stresses your server past a critical threshold.  A real world example will help to elucidate this.

We fulfill our eMail touchpoints through a service hosted by Rackspace. They communicate clicks, bounces, and opens back to a webservice on one of our servers; that webservice connects to a couple of databases on internal SQL servers. This is a fairly typical modern client-vendor arrangement. The webservice constantly runs around 90 transactions per second through the SQL database, a perfectly reasonable load.

One day a developer opened up a query that locked the eMail transaction table for a couple of minutes. The webservice began throwing timeout messages to our internal notifications queue. Even after the developer closed their session the timeout notifications continued from the webservice for a couple more hours.

When I pulled up sp_who2 there were hundreds of connections to SQL from the webservice, where normally during the day I might glimpse one or two. After a couple of hours the problem abated. When I awoke the next morning my subconscious had pieced together the problem, along with this nice metaphor.

Suppose you're driving along on the 101 freeway with light traffic and way up ahead you see some tailights glisten. In anticipation you remove your foot from the gas and let the car slow down on its own. Nearby traffic does the same, and so there is a brief slowing that passes quickly as you resume your normal speed. Next scenario: as you're driving along the 101 in light traffic the fog starts to thicken, so you slow down. Suddenly up ahead tailights! You switch to the brake, slow down, and then once you're past the distraction resume your slightly slower speed. There is a brief and persistent restricted slowing at the original location.

Now take this scenario: on the 101 in very heavy traffic, the fog thickening. The car in front of you slows, you brake, the car behind you doesn't have much room so they brake hard, and the car behind that hits them. Accident. Now that traffic has stopped a chain reaction of accidents keeps trickling back. One brief constriction, and now the blockage will be there for many hours.

When I examined the queries that were executing from the webservice I noticed they did a Select top(1) * of the row in the event table matching the key for the transaction. Unfortunately one of the fields in that row was an XML copy of the actual eMail we sent, usually around 10 MB in size (you know, graphics and all). At the typical rate of 90 transactions per second this was just about the right size to squeeze through the 10 gigabit switch between the two servers.

As soon as somebody fired off another large query on the same table though, the switch temporarily got overwhelmed, and transactions that would clear in real time suddenly were starting to queue up. Once multiple transactions were running against the same table the extra load began slowing their execution. A constriction in heavy traffic.

The solution was to change the webservice to use a view that excluded the XML column. But frankly, if the problem had never happened, I would never have specifically looked for it, as it was just below the radar.

Tuesday, August 15, 2017

All Them Logs

Oh yes those dear transaction logs: can't live with them, can't live without them. Understanding how transaction logs work and how to manage them is the key to having a happy DBA life. The transaction log is basically a copy of all SQL statements run against your database, (except for things like bulk inserts in Bulk Logged mode, or inserts made when traceflag 610 is on). In a disaster on a database set on Full Recovery you can obtain a fully functioning database by first restoring the last complete backup, and then restoring any subsequent log backups. You may still of course have to manually run any bulk imports executed after the last dB backup, but that's seldom an issue.

The nice thing about logs is they backup really fast (they are after all just a bunch of SQL). I can back up a log for a day's worth of activity on a heavy production database in about 15 minutes. The nasty things about logs is they restore rather slowly (as the engine has to re-execute all that SQL). My 15 minute backup takes a good hour and a half to restore. The other nasty thing about log files is, if you go with Instant File Initialization and autogrow, then you need to manage their size by shrinking them. If the device hosting your log files for a database runs out of space, then that database won't execute any more transactions. No bueno.

Hence it's a good idea to add a customized alert to RedGate monitor when your log file expands beyond a certain size. Also I use a little trick to "flip" between different copies of the log file: right click on your database, choose properties, and then files. Add a couple files of type "log" but leave their size small and turn off autogrowth. Now when your primary log file gets too large, turn autogrowth on one of the small files, and off on your primary. After the next log backup you should be able to shrink the primary log file down to nearly nothing (well if there aren't long pending transactions still out there). In this manner you keep occasionally swapping between active log files as their size dictates.

Another nifty trick is to allocate an extra spare log file to a different physical device (but leave it small without growth). That way in an emergency where your disk drive hosting the logs fills up, you can just click into properties and turn autogrowth on your spare log file. Once that's done, backup and shrink the primary logs.

Of course if you don't want to hassle with the logs, don't need point-in-time recovery, and don't need them for log-shipping to a synchronized database, just leave your database on Simple recovery. You can then shrink the logs at your own leisure (usually after a full dB backup) and you won't have to worry about backing up the logs themselves.

Thursday, July 20, 2017

Fixing a Slow Query

I've written a full category of posts on how to generally optimize your database instance, indices, and storage. Quite frequently though the DBA finds themselves responding to a specific performance concern: this one query is running especially slow now. It used to run fine before, but now it is timing out. I can't even begin to tell you how frequently this happens.  The usual cause is either one of two things: blocking, or an obsolete plan cache.

Blocking is quite easy to ferret out; this replay of a who2 will rapidly show you any issues:

CREATE TABLE #sp_who2 (SPID INT,Status VARCHAR(255),
      Login  VARCHAR(255),HostName  VARCHAR(255),
      BlkBy  VARCHAR(255),DBName  VARCHAR(255),
      Command VARCHAR(255),CPUTime INT,
      DiskIO INT,LastBatch VARCHAR(255),
      ProgramName VARCHAR(255),SPID2 INT,
      REQUESTID INT)
     
INSERT INTO #sp_who2 EXEC sp_who2

SELECT     distinct SPID, Status, Login, HostName, BlkBy, DBName, Command, LastBatch, ProgramName
FROM        #sp_who2
where Command not in ('AWAITING COMMAND', 'TASK MANAGER', 'BRKR TASK') and
DBName is not null
and SPID <> @@SPID

drop table #sp_who2

Naturally your favorite instance monitor (such as RedGate) will also show blocking. Why would blocking suddenly start happening? Well typically a developer writes a new long-running stored proc (or sometimes creates a query to pull all of the data across several large tables) and doesn't use the Nolock hint. Nolock isn't always the appropriate prescription, but it is a reasonable choice for reporting when intervening update production cycles are clear. (Nolock is like a read uncommitted).

When a long-running stored proc is blocking you likely will need to get a developer involved for troubleshooting. Frequently it makes sense to create batches (of say 10000 records each) and loop over the commits in smaller parcels.

By far the most frequent cause of long running timeouts however relates to how developers create code. When they start on a project the source tables are small, so when they run the query the first few times the plan engine says Hey, I can full scan this. Many of the joins are small enough to be cached completely in memory. Once the tables have been up and running a few months and have accumulated several hundred thousand rows the old way of filtering and joining are naturally too slow for the engine.

I do read of DBAs who take the simple step of clearing the plan cache, but I prefer a more nuanced approach. Most of the time what works best is some actual inspection of the SQL statement, pulling it apart to deduce what new indices would help. The objective is to create an environment where the engine only has to scan or seek indices without going to the base tables.  Here's a perfect example... the developer said it used to run in a couple of seconds originally but now exceeds the 30-second default .net timeout:

   SELECT DISTINCT tv.VIN_ID AS 'VIN', d.DealerCode AS 'DealerCode',
        mc.[Segment] AS 'Segment', tv.UserLogin AS 'UserLogin', tv.Destination
    FROM [TXN3_VIN] tv (NOLOCK)
    INNER JOIN [MarketingList] mc ON mc.VIN = tv.VIN_ID AND mc.MonthID = tv.MonthID
    INNER JOIN dealer d on d.DealerID = tv.DealerID
    WHERE tv.Destination = @dest  AND tv.SubmitDateTime = @timeStamp
        and tv.Campaign = 1
    ORDER BY tv.[VIN_ID], mc.[Segment]

When I looked at the query plan the execution count estimates on the index seeks where rather small, in the low hundreds.  When I checked the base tables though they had hundreds of thousands of rows. The query designer didn't alert to any missing indices, but clearly something was awry here. Time for a deeper inspection.

First, you have to think like you're the SQL engine. The most restrictive part of the selection process on this query is the where clause, so if you can create an index using the columns of the where clause that will certainly help.

CREATE NONCLUSTERED INDEX [ix_TXN3_subDestCamp] ON [dbo].[TXN3_VIN]
    ([SubmitDateTime] ASC,
    [Destination] ASC,
    [Campaign] ASC)
INCLUDE ( [UserLogin], [VIN_ID], [MonthID])
WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 85) ON [PRIMARY]

Notice I included the columns of UserLogin, VIN_ID, and MonthID; VIN_ID and MonthID are there because the engine will need it when it joins to the MarketingList table.  UserLogin is there because the output results will require it. Now that that part is fixed, consider the join. It will need these columns from the MarketingList table to join:

CREATE NONCLUSTERED INDEX [ix_MktgList_VinMonth] ON [dbo].[MarketingList]
    ([VIN] ASC,
    [MonthID] ASC)
INCLUDE ( [Segment]) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 85) ON [PRIMARY]

On this index I included the Segment column, because the output results will require it.

After building these two indices, the query executed in less than half a second. Problem solved.

Thursday, June 15, 2017

Reorgs, Rebuilds, and Fill

In an earlier post I chatted about the joy of creating indices; now we get a bit deeper into the topic to explore how you maintain the things. Then we'll circle back to discuss how you should have built them in the first place, so as to minimize your maintenance (there's nothing like 20-20 hindsight, eh?).

Let me preface the post with a disclaimer: some DBAs feel the best way to handle index maintenance is to use the wizard to create a maintenance plan for overnight or weekend defragments. I call this the lazy man's way of index maintenance, but it does have its place under many circumstances. If your employees use your database mainly during weekday working hours and it is quiescent at night (or on the weekend) and if it's not gigantic (say 100 gigabyte or less), this automatic maintenance is a smart idea and works well enough. Otherwise though I have concerns that it is a bit too easy and hides the sins of bad index design in the first place.

For large databases that never sleep you should take care to spend some time reorganizing indices online. If you are running the enterprise edition you can even rebuild indices online. Certainly not all of them however. First, learn to target the indices that need the most love and attention; I have several ways to do this. In the database explorer window of SSMS, after clicking on Tables, include the column for index size and sort the tables descending. Double click the top table, double click the index icon, and again include the column for index size. Sort descending. Fortunately SSMS remembers all this column setup so now that you've done it, it will always be at the ready.

Every week or so page through the topmost indices, right click on properties, then click fragmentation. Anything in the 30 to 40 percent range should be defragmented. How do you do this? Well you have two options, reorganize or rebuild, but more about that briefly.
Another good way to find indices to target is by running this query (the same one as in my earlier post on indices). The top rows in the resultset are the heaviest used so they need a bit tighter maintenance planning.
SELECT sys.objects.name AS object_name, sys.indexes.name AS index_name,
Case when (is_unique = 1 and is_primary_key = 1) then 'PK UNIQUE '
when is_unique = 1  then 'UNIQUE ' else '' end + sys.indexes.type_desc type_desc,
c.index_columns AS index_columns_key, s.used_page_count * 8   AS IndexSizeKB,
sys.dm_db_index_usage_stats.user_seeks, sys.dm_db_index_usage_stats.user_scans,
sys.dm_db_index_usage_stats.user_lookups, sys.dm_db_index_usage_stats.user_updates,
sys.dm_db_index_usage_stats.last_user_seek, sys.dm_db_index_usage_stats.last_user_scan,
sys.dm_db_index_usage_stats.last_user_lookup, sys.dm_db_index_usage_stats.last_user_update
FROM sys.objects JOIN sys.indexes ON sys.indexes.object_id=sys.objects.object_id
JOIN (SELECT distinct
object_id, index_id, stuff((SELECT ','+col_name(object_id,column_id ) as 'data()'
FROM sys.index_columns t2
WHERE t1.object_id =t2.object_id and t1.index_id = t2.index_id
FOR XML PATH ('')),1,1,'') as 'index_columns'
FROM sys.index_columns t1
) c ON c.index_id = sys.indexes.index_id AND c.object_id = sys.indexes.object_id
LEFT OUTER JOIN sys.dm_db_index_usage_stats
ON sys.indexes.index_id=sys.dm_db_index_usage_stats.index_id
AND sys.indexes.object_id=sys.dm_db_index_usage_stats.object_id
JOIN sys.dm_db_partition_stats  AS s
ON s.[object_id] = sys.indexes.object_id AND s.index_id = sys.indexes.index_id
WHERE sys.objects.type='u'                 --Only UserTables will be selected
AND sys.indexes.type_desc  <> 'HEAP'       --Only indexes will appear, not HEAP tables
AND sys.indexes.type_desc like '%NONCLUSTER%'
ORDER BY 6 desc
Now that you're ready to fix these puppies, you can reorganize or rebuild. A reorganize unshuffles the b-tree by using the free space you left by allowing a fill factor (ahhh, you didn't allow a fill factor? More on that in a moment). The nice thing about a reorganize is you can interrupt it any time without incurring any rollback and it doesn't interfere much with existing SQL activity. Option two is a rebuild. Unless you have enterprise edition though, a rebuild will interfere or deadlock other queries and may escalate to a table lock. You do however have to do a rebuild if you are changing the fill factor (so back to that again).

Now you see the value of a fill factor: it creates extra space for inter-branch growth of the b-tree, so your indices don't fragment as quickly. The extra allocated space also provides the means for running the online reorganization. How much fill should you allow? I start by setting the fill factor to 90 (in other words 10 percent vacant space). Then when I am checking for fragmentation I may adjust the fill lower, depending upon the property of page utilization. Usually I midpoint the difference, so for example if my present fill factor is 90 and the fragmented index shows 70 percent page fullness, I'll schedule a rebuild to set the fill factor down to 80.

Keeping your indices happy will make your developers and your managers happy, and it's a pleasant enough diversion for a daytime job.

Tuesday, May 16, 2017

Mid-Sized Databases

Most of my problem indices happen on mid-size databases (MSDs), in the range of 100 GB or so. Small databases rebuild their indicies quickly; giant databases tend either to get reloaded and reindexed offline or have a limited exposure to random query plans.  The MSDs though can be sticklers for high random access, high required availability, and a large quantity of fairly large indices.

Several factors contribute to MSD index maintenance woes.  For one, MSDs often run on the Standard Edition of SQL, so you can't rebuild the indices with the "online" option. Secondly MSDs are large enough that rebuilding an index takes a fair amount of time.  Finally the MSD indices often get built by developers who forget to include a fill factor.

Obviously at your first opportunity you will schedule some jobs to rebuild the fragmented indices with an adequate fill factor.  More importantly though you should find some time to meet with the developers and encourage them to use a fill factor (when appropriate) on indices where the referenced fields are clearly loading with out-of-sequence data. They'll be clueless what factor to use so it's easiest to specify a safe fixed fill, say 85 or 90%... then you can adjust it to lower if you find it fragmenting exceedingly fast.

Once you've got your fill allocated the next challenge is finding the time to run index reorganization on a regular basis. Although it won't lock tables or block other spids, the reorgs do take a fair amount of resources. If you've got a nice chunk of quiet time on the weekend you can schedule a big reorg job then, naturally, but if your systems is always active then you might like this little trick.

First build a table within your MSD to hold the reorganization SQL statements; I use something like this:
CREATE TABLE [IndexReorgScripts](
[ReorgSQL] [varchar](350) NOT NULL,
[RunDayNum1] [smallint] NULL,
[RunDayNum2] [smallint] NULL) 
Now from a connection to your MSD, run this query:
INSERT INTO IndexReorgScripts
        (ReorgSQL)
SELECT 'ALTER INDEX ' + [name] + ' ON [dbo].' + OBJECT_NAME(OBJECT_ID) +
' REORGANIZE WITH ( LOB_COMPACTION = ON )' AS ReorgSQL
FROM sys.indexes
where type_desc = 'NONCLUSTERED'
and fill_factor > 0
and is_disabled = 0
This will build all the statements to reorganize the SQL indices. The two RunDayNum columns are for you to set to a day of the month you want the index rebuilt -- I made two columns because I reorganize my indices on average twice a month.  Try to assign a broad range of days across the various tables.

Finally, schedule a SQL job to run this script every evening:
DECLARE @ReorgStatement varchar(350)
DECLARE @ThisDayOfMonth AS smallint
SET @ThisDayOfMonth = DAY(GETDATE())
DECLARE IndicesToReorg CURSOR FOR
SELECT [ReorgSQL]
  FROM [IndexReorgScripts]
  WHERE [ReorgSQL] LIKE 'ALTER INDEX%'
         AND ( [RunDayNum1] = @ThisDayOfMonth
OR [RunDayNum2] = @ThisDayOfMonth )
OPEN IndicesToReorg;
FETCH NEXT FROM IndicesToReorg into @ReorgStatement
WHILE @@FETCH_STATUS = 0
BEGIN
   EXEC(@ReorgStatement)
   FETCH NEXT FROM IndicesToReorg into @ReorgStatement
END
CLOSE IndicesToReorg;
DEALLOCATE IndicesToReorg;
Now you're all set to check in every morning and find another handful of indices got rebuilt!  The method isn't foolproof; you will need to occasionally update the table with additional indices that folks have added, or remove those that got disabled.  Still though quite a lifesaver.