Tuesday, August 15, 2017

All Them Logs

Oh yes those dear transaction logs: can't live with them, can't live without them. Understanding how transaction logs work and how to manage them is the key to having a happy DBA life. The transaction log is basically a copy of all SQL statements run against your database, (except for things like bulk inserts in Bulk Logged mode, or inserts made when traceflag 610 is on). In a disaster on a database set on Full Recovery you can obtain a fully functioning database by first restoring the last complete backup, and then restoring any subsequent log backups. You may still of course have to manually run any bulk imports executed after the last dB backup, but that's seldom an issue.

The nice thing about logs is they backup really fast (they are after all just a bunch of SQL). I can back up a log for a day's worth of activity on a heavy production database in about 15 minutes. The nasty things about logs is they restore rather slowly (as the engine has to re-execute all that SQL). My 15 minute backup takes a good hour and a half to restore. The other nasty thing about log files is, if you go with Instant File Initialization and autogrow, then you need to manage their size by shrinking them. If the device hosting your log files for a database runs out of space, then that database won't execute any more transactions. No bueno.

Hence it's a good idea to add a customized alert to RedGate monitor when your log file expands beyond a certain size. Also I use a little trick to "flip" between different copies of the log file: right click on your database, choose properties, and then files. Add a couple files of type "log" but leave their size small and turn off autogrowth. Now when your primary log file gets too large, turn autogrowth on one of the small files, and off on your primary. After the next log backup you should be able to shrink the primary log file down to nearly nothing (well if there aren't long pending transactions still out there). In this manner you keep occasionally swapping between active log files as their size dictates.

Another nifty trick is to allocate an extra spare log file to a different physical device (but leave it small without growth). That way in an emergency where your disk drive hosting the logs fills up, you can just click into properties and turn autogrowth on your spare log file. Once that's done, backup and shrink the primary logs.

Of course if you don't want to hassle with the logs, don't need point-in-time recovery, and don't need them for log-shipping to a synchronized database, just leave your database on Simple recovery. You can then shrink the logs at your own leisure (usually after a full dB backup) and you won't have to worry about backing up the logs themselves.

Thursday, July 20, 2017

Fixing a Slow Query

I've written a full category of posts on how to generally optimize your database instance, indices, and storage. Quite frequently though the DBA finds themselves responding to a specific performance concern: this one query is running especially slow now. It used to run fine before, but now it is timing out. I can't even begin to tell you how frequently this happens.  The usual cause is either one of two things: blocking, or an obsolete plan cache.

Blocking is quite easy to ferret out; this replay of a who2 will rapidly show you any issues:

CREATE TABLE #sp_who2 (SPID INT,Status VARCHAR(255),
      Login  VARCHAR(255),HostName  VARCHAR(255),
      BlkBy  VARCHAR(255),DBName  VARCHAR(255),
      Command VARCHAR(255),CPUTime INT,
      DiskIO INT,LastBatch VARCHAR(255),
      ProgramName VARCHAR(255),SPID2 INT,
INSERT INTO #sp_who2 EXEC sp_who2

SELECT     distinct SPID, Status, Login, HostName, BlkBy, DBName, Command, LastBatch, ProgramName
FROM        #sp_who2
where Command not in ('AWAITING COMMAND', 'TASK MANAGER', 'BRKR TASK') and
DBName is not null
and SPID <> @@SPID

drop table #sp_who2

Naturally your favorite instance monitor (such as RedGate) will also show blocking. Why would blocking suddenly start happening? Well typically a developer writes a new long-running stored proc (or sometimes creates a query to pull all of the data across several large tables) and doesn't use the Nolock hint. Nolock isn't always the appropriate prescription, but it is a reasonable choice for reporting when intervening update production cycles are clear. (Nolock is like a read uncommitted).

When a long-running stored proc is blocking you likely will need to get a developer involved for troubleshooting. Frequently it makes sense to create batches (of say 10000 records each) and loop over the commits in smaller parcels.

By far the most frequent cause of long running timeouts however relates to how developers create code. When they start on a project the source tables are small, so when they run the query the first few times the plan engine says Hey, I can full scan this. Many of the joins are small enough to be cached completely in memory. Once the tables have been up and running a few months and have accumulated several hundred thousand rows the old way of filtering and joining are naturally too slow for the engine.

I do read of DBAs who take the simple step of clearing the plan cache, but I prefer a more nuanced approach. Most of the time what works best is some actual inspection of the SQL statement, pulling it apart to deduce what new indices would help. The objective is to create an environment where the engine only has to scan or seek indices without going to the base tables.  Here's a perfect example... the developer said it used to run in a couple of seconds originally but now exceeds the 30-second default .net timeout:

   SELECT DISTINCT tv.VIN_ID AS 'VIN', d.DealerCode AS 'DealerCode',
        mc.[Segment] AS 'Segment', tv.UserLogin AS 'UserLogin', tv.Destination
    INNER JOIN [MarketingList] mc ON mc.VIN = tv.VIN_ID AND mc.MonthID = tv.MonthID
    INNER JOIN dealer d on d.DealerID = tv.DealerID
    WHERE tv.Destination = @dest  AND tv.SubmitDateTime = @timeStamp
        and tv.Campaign = 1
    ORDER BY tv.[VIN_ID], mc.[Segment]

When I looked at the query plan the execution count estimates on the index seeks where rather small, in the low hundreds.  When I checked the base tables though they had hundreds of thousands of rows. The query designer didn't alert to any missing indices, but clearly something was awry here. Time for a deeper inspection.

First, you have to think like you're the SQL engine. The most restrictive part of the selection process on this query is the where clause, so if you can create an index using the columns of the where clause that will certainly help.

    ([SubmitDateTime] ASC,
    [Destination] ASC,
    [Campaign] ASC)
INCLUDE ( [UserLogin], [VIN_ID], [MonthID])

Notice I included the columns of UserLogin, VIN_ID, and MonthID; VIN_ID and MonthID are there because the engine will need it when it joins to the MarketingList table.  UserLogin is there because the output results will require it. Now that that part is fixed, consider the join. It will need these columns from the MarketingList table to join:

CREATE NONCLUSTERED INDEX [ix_MktgList_VinMonth] ON [dbo].[MarketingList]
    ([VIN] ASC,
    [MonthID] ASC)

On this index I included the Segment column, because the output results will require it.

After building these two indices, the query executed in less than half a second. Problem solved.

Thursday, June 15, 2017

Reorgs, Rebuilds, and Fill

In an earlier post I chatted about the joy of creating indices; now we get a bit deeper into the topic to explore how you maintain the things. Then we'll circle back to discuss how you should have built them in the first place, so as to minimize your maintenance (there's nothing like 20-20 hindsight, eh?).

Let me preface the post with a disclaimer: some DBAs feel the best way to handle index maintenance is to use the wizard to create a maintenance plan for overnight or weekend defragments. I call this the lazy man's way of index maintenance, but it does have its place under many circumstances. If your employees use your database mainly during weekday working hours and it is quiescent at night (or on the weekend) and if it's not gigantic (say 100 gigabyte or less), this automatic maintenance is a smart idea and works well enough. Otherwise though I have concerns that it is a bit too easy and hides the sins of bad index design in the first place.

For large databases that never sleep you should take care to spend some time reorganizing indices online. If you are running the enterprise edition you can even rebuild indices online. Certainly not all of them however. First, learn to target the indices that need the most love and attention; I have several ways to do this. In the database explorer window of SSMS, after clicking on Tables, include the column for index size and sort the tables descending. Double click the top table, double click the index icon, and again include the column for index size. Sort descending. Fortunately SSMS remembers all this column setup so now that you've done it, it will always be at the ready.

Every week or so page through the topmost indices, right click on properties, then click fragmentation. Anything in the 30 to 40 percent range should be defragmented. How do you do this? Well you have two options, reorganize or rebuild, but more about that briefly.
Another good way to find indices to target is by running this query (the same one as in my earlier post on indices). The top rows in the resultset are the heaviest used so they need a bit tighter maintenance planning.
SELECT sys.objects.name AS object_name, sys.indexes.name AS index_name,
Case when (is_unique = 1 and is_primary_key = 1) then 'PK UNIQUE '
when is_unique = 1  then 'UNIQUE ' else '' end + sys.indexes.type_desc type_desc,
c.index_columns AS index_columns_key, s.used_page_count * 8   AS IndexSizeKB,
sys.dm_db_index_usage_stats.user_seeks, sys.dm_db_index_usage_stats.user_scans,
sys.dm_db_index_usage_stats.user_lookups, sys.dm_db_index_usage_stats.user_updates,
sys.dm_db_index_usage_stats.last_user_seek, sys.dm_db_index_usage_stats.last_user_scan,
sys.dm_db_index_usage_stats.last_user_lookup, sys.dm_db_index_usage_stats.last_user_update
FROM sys.objects JOIN sys.indexes ON sys.indexes.object_id=sys.objects.object_id
JOIN (SELECT distinct
object_id, index_id, stuff((SELECT ','+col_name(object_id,column_id ) as 'data()'
FROM sys.index_columns t2
WHERE t1.object_id =t2.object_id and t1.index_id = t2.index_id
FOR XML PATH ('')),1,1,'') as 'index_columns'
FROM sys.index_columns t1
) c ON c.index_id = sys.indexes.index_id AND c.object_id = sys.indexes.object_id
LEFT OUTER JOIN sys.dm_db_index_usage_stats
ON sys.indexes.index_id=sys.dm_db_index_usage_stats.index_id
AND sys.indexes.object_id=sys.dm_db_index_usage_stats.object_id
JOIN sys.dm_db_partition_stats  AS s
ON s.[object_id] = sys.indexes.object_id AND s.index_id = sys.indexes.index_id
WHERE sys.objects.type='u'                 --Only UserTables will be selected
AND sys.indexes.type_desc  <> 'HEAP'       --Only indexes will appear, not HEAP tables
AND sys.indexes.type_desc like '%NONCLUSTER%'
ORDER BY 6 desc
Now that you're ready to fix these puppies, you can reorganize or rebuild. A reorganize unshuffles the b-tree by using the free space you left by allowing a fill factor (ahhh, you didn't allow a fill factor? More on that in a moment). The nice thing about a reorganize is you can interrupt it any time without incurring any rollback and it doesn't interfere much with existing SQL activity. Option two is a rebuild. Unless you have enterprise edition though, a rebuild will interfere or deadlock other queries and may escalate to a table lock. You do however have to do a rebuild if you are changing the fill factor (so back to that again).

Now you see the value of a fill factor: it creates extra space for inter-branch growth of the b-tree, so your indices don't fragment as quickly. The extra allocated space also provides the means for running the online reorganization. How much fill should you allow? I start by setting the fill factor to 90 (in other words 10 percent vacant space). Then when I am checking for fragmentation I may adjust the fill lower, depending upon the property of page utilization. Usually I midpoint the difference, so for example if my present fill factor is 90 and the fragmented index shows 70 percent page fullness, I'll schedule a rebuild to set the fill factor down to 80.

Keeping your indices happy will make your developers and your managers happy, and it's a pleasant enough diversion for a daytime job.

Tuesday, May 16, 2017

Mid-Sized Databases

Most of my problem indices happen on mid-size databases (MSDs), in the range of 100 GB or so. Small databases rebuild their indicies quickly; giant databases tend either to get reloaded and reindexed offline or have a limited exposure to random query plans.  The MSDs though can be sticklers for high random access, high required availability, and a large quantity of fairly large indices.

Several factors contribute to MSD index maintenance woes.  For one, MSDs often run on the Standard Edition of SQL, so you can't rebuild the indices with the "online" option. Secondly MSDs are large enough that rebuilding an index takes a fair amount of time.  Finally the MSD indices often get built by developers who forget to include a fill factor.

Obviously at your first opportunity you will schedule some jobs to rebuild the fragmented indices with an adequate fill factor.  More importantly though you should find some time to meet with the developers and encourage them to use a fill factor (when appropriate) on indices where the referenced fields are clearly loading with out-of-sequence data. They'll be clueless what factor to use so it's easiest to specify a safe fixed fill, say 85 or 90%... then you can adjust it to lower if you find it fragmenting exceedingly fast.

Once you've got your fill allocated the next challenge is finding the time to run index reorganization on a regular basis. Although it won't lock tables or block other spids, the reorgs do take a fair amount of resources. If you've got a nice chunk of quiet time on the weekend you can schedule a big reorg job then, naturally, but if your systems is always active then you might like this little trick.

First build a table within your MSD to hold the reorganization SQL statements; I use something like this:
CREATE TABLE [IndexReorgScripts](
[ReorgSQL] [varchar](350) NOT NULL,
[RunDayNum1] [smallint] NULL,
[RunDayNum2] [smallint] NULL) 
Now from a connection to your MSD, run this query:
INSERT INTO IndexReorgScripts
SELECT 'ALTER INDEX ' + [name] + ' ON [dbo].' + OBJECT_NAME(OBJECT_ID) +
FROM sys.indexes
where type_desc = 'NONCLUSTERED'
and fill_factor > 0
and is_disabled = 0
This will build all the statements to reorganize the SQL indices. The two RunDayNum columns are for you to set to a day of the month you want the index rebuilt -- I made two columns because I reorganize my indices on average twice a month.  Try to assign a broad range of days across the various tables.

Finally, schedule a SQL job to run this script every evening:
DECLARE @ReorgStatement varchar(350)
DECLARE @ThisDayOfMonth AS smallint
SET @ThisDayOfMonth = DAY(GETDATE())
  FROM [IndexReorgScripts]
         AND ( [RunDayNum1] = @ThisDayOfMonth
OR [RunDayNum2] = @ThisDayOfMonth )
OPEN IndicesToReorg;
FETCH NEXT FROM IndicesToReorg into @ReorgStatement
   FETCH NEXT FROM IndicesToReorg into @ReorgStatement
CLOSE IndicesToReorg;
DEALLOCATE IndicesToReorg;
Now you're all set to check in every morning and find another handful of indices got rebuilt!  The method isn't foolproof; you will need to occasionally update the table with additional indices that folks have added, or remove those that got disabled.  Still though quite a lifesaver.

Monday, April 17, 2017

Grokking Query Plans

In an earlier post I gave you some good clues to identify the problematic queries on your instance. After you've finished the initial review in the Designer to resolve bad joins and poorly formed filters, you should click the button "Show Execution Plan". Right off the bat you may see a red warning of a missing index. If the projected savings are high then you should stop and build that index first, before you continue examining the plan.

There are a few hints you should know about when reviewing an execution plan. First off check the width (the thickness) of the arrows joining the elements. Thicker paths indicate more voluminous records are traversing. Now examine the percentage shown underneath each element. They are approximate but indicate how much processing is dedicated to each step. Work from the bottom right corner of the graph to the upper left corner (for some reason the engine displays the actual dataflow sequence right to left, bottom to top).

Generally "scans" are way worse than "seeks," and operations on tables will be slower than the same on an index. The exception is a clustered index scan, which in actuality is a regular table scan (that has a clustered index).

To look at it strictly from the engine's viewpoint: the elapsed time for access depends upon the number of rows it has to read, multiplied by the quantity of times it needs to read them, multiplied by the number of pages it needs to read (determined by the width of the sum of the columns it needs to access and then write to tempdb). From that viewpoint you want to minimize read and write lengths, counts, and widths.

Hover your mouse over one of the query plan boxes and it will show you the estimated number of rows it expects to read, and how often it thinks it need to repeat this operation (number of executions). Table scans (and clustered index scans) of wide tables with many rows should be avoided.

Even with all its smarts the plan engine is only about half as bright as you. The optimizer won't always suggest a helpful index even if you could easily create one. If you examine the criteria in the where clause and the way tables are joined, you can almost always come up with an effective index encapsulating what the engine needs for its filter and joins (thus turning scans into seeks). Furthermore if the where clause has a range of selection (for example "> getdate() - 7") you can build a filtered index that will be even faster.

The primary objective is to narrow down the length and width of your resultset as early into the dataflow as possible, to reduce the stress of joins further up the line. Many times you can do this by building an included subquery, or by splitting out the keys of the targeted records into a temp table to begin with in a first step. In other words, you want to delay adding in the fat wide rows until the end of the dataflow, when you have already applied the necessary filters to determine the count of rows you will be returning.

Spend a bit of occasional love reviewing the worst queries and if the fix requires more than just adding an index, pass your recommendations back to the development staff so they can learn and improve while going forward.