Tuesday, November 14, 2017

Memory Used

In an earlier post I mentioned some general guidelines for optimizing each instance of the SQL engine. Once you've got the memory set up it is quite interesting however to observe how the engine is actually using it. Run these two queries; they both show pretty much the same thing but you may find one more useful than the other.
SELECT
(CASE WHEN ([is_modified] = 1) THEN 'Dirty' ELSE 'Clean' END) AS 'Page State',
(CASE WHEN ([database_id] = 32767) THEN 'Resource Database' ELSE DB_NAME (database_id) END) AS 'Database Name',
COUNT (*) AS 'Page Count'
FROM sys.dm_os_buffer_descriptors
GROUP BY [database_id], [is_modified]
ORDER BY [Page Count], [is_modified];
GO

SELECT count(*)AS cached_pages_count, (count(*) * 8)/1024 As Mbytes,  db_name(database_id)
FROM sys.dm_os_buffer_descriptors
GROUP BY db_name(database_id) ,database_id
ORDER BY cached_pages_count DESC;
GO
The amount of memory used naturally varies depending upon which queries are running at any particular time, but by monitoring this when you are under "memory pressure" you can easily identify the culprit database. This can then guide you whether it makes sense to reallocate that database to a different SQL instance, or perhaps allocate a deeper inspection to what queries are running against it.

To that end, you can run this script:
SELECT        session_id, text, requested_memory_kb,
    granted_memory_kb, used_memory_kb
FROM            sys.dm_exec_query_memory_grants MG
CROSS APPLY sys.dm_exec_sql_text(sql_handle) 
ORDER BY 3 DESC

that will show you the /current/ allocation for any running queries. If you need greater detail down at the table index level, you can run this script:
select
       count(*)as cached_pages_count, obj.name as objectname,
       ind.name as indexname, obj.index_id as indexid
from sys.dm_os_buffer_descriptors as bd
    inner join
    (
SELECT     p.object_id AS objectid, object_name(p.object_id) AS name, p.index_id, au.allocation_unit_id
FROM       sys.allocation_units au INNER JOIN
sys.partitions p ON au.container_id = p.hobt_id AND (au.type = 1 OR au.type = 3)      
        union all
SELECT     p.object_id AS objectid, object_name(p.object_id) AS name, p.index_id, au.allocation_unit_id
FROM       sys.allocation_units au INNER JOIN
sys.partitions p ON au.container_id = p.partition_id AND au.type = 2
    ) as obj
on bd.allocation_unit_id = obj.allocation_unit_id
left outer join sys.indexes ind on  obj.objectid = ind.object_id and  obj.index_id = ind.index_id
where bd.database_id = db_id() and bd.page_type in ('data_page', 'index_page')
group by obj.name, ind.name, obj.index_id
order by cached_pages_count desc

Monday, October 16, 2017

IO Stalls and Latency

One of the more useful optimizations you can run on a database instance if you are unfortunate enough to be on old fashioned spinning media (as opposed to SSDs) is to split up files to reduce the amount of stalls. These can be either database or log files, for either user or system databases, reads or writes. In an earlier post I gave a couple of queries you could run to get a quick gross overview, but now let's get further into the weeds and fine tune some of those slower files in detail. Run these queries:
select db_name(mf.database_id) as database_name, mf.physical_name,
(vfs.num_of_bytes_read / vfs.io_stall_read_ms) as BytesPerStall,
mf.type_desc, vfs.num_of_reads, vfs.num_of_bytes_read, vfs.io_stall_read_ms,
vfs.io_stall, vfs.size_on_disk_bytes
from sys.master_files mf
join sys.dm_io_virtual_file_stats(NULL, NULL) vfs
on mf.database_id=vfs.database_id and mf.file_id=vfs.file_id
where vfs.io_stall_read_ms > 0
order by 3

select db_name(mf.database_id) as database_name, mf.physical_name,
(vfs.num_of_bytes_written / vfs.io_stall_write_ms) as BytesPerStall,
mf.type_desc, vfs.num_of_writes, vfs.num_of_bytes_written, vfs.io_stall_write_ms,
vfs.io_stall, vfs.size_on_disk_bytes
from sys.master_files mf
join sys.dm_io_virtual_file_stats(NULL, NULL) vfs
on mf.database_id=vfs.database_id and mf.file_id=vfs.file_id
where vfs.io_stall_write_ms > 0
order by 3
This returns two resultsets, the first is problem stalls on reads, the second is problem stalls on writes. Do pay attention to the column showing number of bytes read and written; likely it's a waste of time to optimize the smaller files. Now that you know generally who your problem children are, you have a couple of matters of recourse. If the stalls are happening in a log file, then it's time to increase the number of log files or move them to a faster device. If the latency is in a table (a file of type "rows") then you need to dig a little deeper; run this query attached to the database that concerns you.
SELECT TableName = object_name(s.object_id),
       Reads = SUM(user_seeks + user_scans + user_lookups), Writes =  SUM(user_updates)
FROM sys.dm_db_index_usage_stats AS s
INNER JOIN sys.indexes AS i ON s.object_id = i.object_id
AND i.index_id = s.index_id
WHERE objectproperty(s.object_id,'IsUserTable') = 1
AND s.database_id = DB_ID()
GROUP BY object_name(s.object_id)
ORDER BY writes DESC
Of course change the Order By clause for whether you are researching the stalls on reads or on the writes. Likely the top couple of tables in this returned set should be spun off to their own filegroup with multiple files (or possibly their clustered index needs reorganizing -- check their fragmentation).

Wednesday, September 13, 2017

Cascade Blockage

Sometimes the answer to why a SQL process gets blocked relates to overall network dataflow. In these cases you may not be able to tease out the issue until some event stresses your server past a critical threshold.  A real world example will help to elucidate this.

We fulfill our eMail touchpoints through a service hosted by Rackspace. They communicate clicks, bounces, and opens back to a webservice on one of our servers; that webservice connects to a couple of databases on internal SQL servers. This is a fairly typical modern client-vendor arrangement. The webservice constantly runs around 90 transactions per second through the SQL database, a perfectly reasonable load.

One day a developer opened up a query that locked the eMail transaction table for a couple of minutes. The webservice began throwing timeout messages to our internal notifications queue. Even after the developer closed their session the timeout notifications continued from the webservice for a couple more hours.

When I pulled up sp_who2 there were hundreds of connections to SQL from the webservice, where normally during the day I might glimpse one or two. After a couple of hours the problem abated. When I awoke the next morning my subconscious had pieced together the problem, along with this nice metaphor.

Suppose you're driving along on the 101 freeway with light traffic and way up ahead you see some tailights glisten. In anticipation you remove your foot from the gas and let the car slow down on its own. Nearby traffic does the same, and so there is a brief slowing that passes quickly as you resume your normal speed. Next scenario: as you're driving along the 101 in light traffic the fog starts to thicken, so you slow down. Suddenly up ahead tailights! You switch to the brake, slow down, and then once you're past the distraction resume your slightly slower speed. There is a brief and persistent restricted slowing at the original location.

Now take this scenario: on the 101 in very heavy traffic, the fog thickening. The car in front of you slows, you brake, the car behind you doesn't have much room so they brake hard, and the car behind that hits them. Accident. Now that traffic has stopped a chain reaction of accidents keeps trickling back. One brief constriction, and now the blockage will be there for many hours.

When I examined the queries that were executing from the webservice I noticed they did a Select top(1) * of the row in the event table matching the key for the transaction. Unfortunately one of the fields in that row was an XML copy of the actual eMail we sent, usually around 10 MB in size (you know, graphics and all). At the typical rate of 90 transactions per second this was just about the right size to squeeze through the 10 gigabit switch between the two servers.

As soon as somebody fired off another large query on the same table though, the switch temporarily got overwhelmed, and transactions that would clear in real time suddenly were starting to queue up. Once multiple transactions were running against the same table the extra load began slowing their execution. A constriction in heavy traffic.

The solution was to change the webservice to use a view that excluded the XML column. But frankly, if the problem had never happened, I would never have specifically looked for it, as it was just below the radar.

Tuesday, August 15, 2017

All Them Logs

Oh yes those dear transaction logs: can't live with them, can't live without them. Understanding how transaction logs work and how to manage them is the key to having a happy DBA life. The transaction log is basically a copy of all SQL statements run against your database, (except for things like bulk inserts in Bulk Logged mode, or inserts made when traceflag 610 is on). In a disaster on a database set on Full Recovery you can obtain a fully functioning database by first restoring the last complete backup, and then restoring any subsequent log backups. You may still of course have to manually run any bulk imports executed after the last dB backup, but that's seldom an issue.

The nice thing about logs is they backup really fast (they are after all just a bunch of SQL). I can back up a log for a day's worth of activity on a heavy production database in about 15 minutes. The nasty things about logs is they restore rather slowly (as the engine has to re-execute all that SQL). My 15 minute backup takes a good hour and a half to restore. The other nasty thing about log files is, if you go with Instant File Initialization and autogrow, then you need to manage their size by shrinking them. If the device hosting your log files for a database runs out of space, then that database won't execute any more transactions. No bueno.

Hence it's a good idea to add a customized alert to RedGate monitor when your log file expands beyond a certain size. Also I use a little trick to "flip" between different copies of the log file: right click on your database, choose properties, and then files. Add a couple files of type "log" but leave their size small and turn off autogrowth. Now when your primary log file gets too large, turn autogrowth on one of the small files, and off on your primary. After the next log backup you should be able to shrink the primary log file down to nearly nothing (well if there aren't long pending transactions still out there). In this manner you keep occasionally swapping between active log files as their size dictates.

Another nifty trick is to allocate an extra spare log file to a different physical device (but leave it small without growth). That way in an emergency where your disk drive hosting the logs fills up, you can just click into properties and turn autogrowth on your spare log file. Once that's done, backup and shrink the primary logs.

Of course if you don't want to hassle with the logs, don't need point-in-time recovery, and don't need them for log-shipping to a synchronized database, just leave your database on Simple recovery. You can then shrink the logs at your own leisure (usually after a full dB backup) and you won't have to worry about backing up the logs themselves.

Thursday, July 20, 2017

Fixing a Slow Query

I've written a full category of posts on how to generally optimize your database instance, indices, and storage. Quite frequently though the DBA finds themselves responding to a specific performance concern: this one query is running especially slow now. It used to run fine before, but now it is timing out. I can't even begin to tell you how frequently this happens.  The usual cause is either one of two things: blocking, or an obsolete plan cache.

Blocking is quite easy to ferret out; this replay of a who2 will rapidly show you any issues:

CREATE TABLE #sp_who2 (SPID INT,Status VARCHAR(255),
      Login  VARCHAR(255),HostName  VARCHAR(255),
      BlkBy  VARCHAR(255),DBName  VARCHAR(255),
      Command VARCHAR(255),CPUTime INT,
      DiskIO INT,LastBatch VARCHAR(255),
      ProgramName VARCHAR(255),SPID2 INT,
      REQUESTID INT)
     
INSERT INTO #sp_who2 EXEC sp_who2

SELECT     distinct SPID, Status, Login, HostName, BlkBy, DBName, Command, LastBatch, ProgramName
FROM        #sp_who2
where Command not in ('AWAITING COMMAND', 'TASK MANAGER', 'BRKR TASK') and
DBName is not null
and SPID <> @@SPID

drop table #sp_who2

Naturally your favorite instance monitor (such as RedGate) will also show blocking. Why would blocking suddenly start happening? Well typically a developer writes a new long-running stored proc (or sometimes creates a query to pull all of the data across several large tables) and doesn't use the Nolock hint. Nolock isn't always the appropriate prescription, but it is a reasonable choice for reporting when intervening update production cycles are clear. (Nolock is like a read uncommitted).

When a long-running stored proc is blocking you likely will need to get a developer involved for troubleshooting. Frequently it makes sense to create batches (of say 10000 records each) and loop over the commits in smaller parcels.

By far the most frequent cause of long running timeouts however relates to how developers create code. When they start on a project the source tables are small, so when they run the query the first few times the plan engine says Hey, I can full scan this. Many of the joins are small enough to be cached completely in memory. Once the tables have been up and running a few months and have accumulated several hundred thousand rows the old way of filtering and joining are naturally too slow for the engine.

I do read of DBAs who take the simple step of clearing the plan cache, but I prefer a more nuanced approach. Most of the time what works best is some actual inspection of the SQL statement, pulling it apart to deduce what new indices would help. The objective is to create an environment where the engine only has to scan or seek indices without going to the base tables.  Here's a perfect example... the developer said it used to run in a couple of seconds originally but now exceeds the 30-second default .net timeout:

   SELECT DISTINCT tv.VIN_ID AS 'VIN', d.DealerCode AS 'DealerCode',
        mc.[Segment] AS 'Segment', tv.UserLogin AS 'UserLogin', tv.Destination
    FROM [TXN3_VIN] tv (NOLOCK)
    INNER JOIN [MarketingList] mc ON mc.VIN = tv.VIN_ID AND mc.MonthID = tv.MonthID
    INNER JOIN dealer d on d.DealerID = tv.DealerID
    WHERE tv.Destination = @dest  AND tv.SubmitDateTime = @timeStamp
        and tv.Campaign = 1
    ORDER BY tv.[VIN_ID], mc.[Segment]

When I looked at the query plan the execution count estimates on the index seeks where rather small, in the low hundreds.  When I checked the base tables though they had hundreds of thousands of rows. The query designer didn't alert to any missing indices, but clearly something was awry here. Time for a deeper inspection.

First, you have to think like you're the SQL engine. The most restrictive part of the selection process on this query is the where clause, so if you can create an index using the columns of the where clause that will certainly help.

CREATE NONCLUSTERED INDEX [ix_TXN3_subDestCamp] ON [dbo].[TXN3_VIN]
    ([SubmitDateTime] ASC,
    [Destination] ASC,
    [Campaign] ASC)
INCLUDE ( [UserLogin], [VIN_ID], [MonthID])
WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 85) ON [PRIMARY]

Notice I included the columns of UserLogin, VIN_ID, and MonthID; VIN_ID and MonthID are there because the engine will need it when it joins to the MarketingList table.  UserLogin is there because the output results will require it. Now that that part is fixed, consider the join. It will need these columns from the MarketingList table to join:

CREATE NONCLUSTERED INDEX [ix_MktgList_VinMonth] ON [dbo].[MarketingList]
    ([VIN] ASC,
    [MonthID] ASC)
INCLUDE ( [Segment]) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 85) ON [PRIMARY]

On this index I included the Segment column, because the output results will require it.

After building these two indices, the query executed in less than half a second. Problem solved.