All Your SQL Server: Monitoring

Showing posts with label Monitoring. Show all posts

Tuesday, July 17, 2018

Monitors Are Up!

I spend a good ten percent of every day watching monitoring tools. I have a handful of Dynamic Management View (DMV) queries and some stored procs that I like to run (I covered those in this previous post) but I also use three exceptionally helpful third-party tools. Two of them are free and one definitely is not. Still though I don't know what I'd do without any of them.

A happy SQL needs lots of disk space: tempDB, log files, databases, backups... there always seems to be -more- data growing its tentacles (never -less- oddly enough). As long as you're managing the storage allocation on the SQL boxes frankly you might as well be managing it across your entire shop, so you need a tool like Free Disk Management Software, FDMS. You need to be included in the Administrator group on each server; after that the setup allows you to select which disks you're actually interested in monitoring. Use the Options to set more appropriate bar-split color levels, and slow down the sampling to once every half hour. If you sort the display by free-space (low to high) then the problems obviously aggregate toward the top.

I've read several interesting posts on using the built-in Windows Performance Monitor to measure key SQL metrics, but to get serious you'll want to download this terrific free tool, Magnalis PerfPro, that wraps PerfMon in customizable MDI-mode windows with various display options. First though definitely perform an online search of "using PerfMon to monitor SQL" so you'll know which counters to add.

Finally, the semi-expensive monitoring tool that is worth every penny you spend: Redgate Monitor. This takes time to install and tweak properly, but with its excellent historical recordkeeping, variable instance-specified alerts, and flexibly filtered notification-clearing this tool will help you out with both your immediate concerns as well as your deep research.

Tuesday, November 14, 2017

Memory Used

In an earlier post I mentioned some general guidelines for optimizing each instance of the SQL engine. Once you've got the memory set up it is quite interesting however to observe how the engine is actually using it. Run these two queries; they both show pretty much the same thing but you may find one more useful than the other.

SELECT
(CASE WHEN ([is_modified] = 1) THEN 'Dirty' ELSE 'Clean' END) AS 'Page State',
(CASE WHEN ([database_id] = 32767) THEN 'Resource Database' ELSE DB_NAME (database_id) END) AS 'Database Name',
COUNT (*) AS 'Page Count'
FROM sys.dm_os_buffer_descriptors
GROUP BY [database_id], [is_modified]
ORDER BY [Page Count], [is_modified];
GO

SELECT count(*)AS cached_pages_count, (count(*) * 8)/1024 As Mbytes, db_name(database_id)
FROM sys.dm_os_buffer_descriptors
GROUP BY db_name(database_id) ,database_id
ORDER BY cached_pages_count DESC;
GO

The amount of memory used naturally varies depending upon which queries are running at any particular time, but by monitoring this when you are under "memory pressure" you can easily identify the culprit database. This can then guide you whether it makes sense to reallocate that database to a different SQL instance, or perhaps allocate a deeper inspection to what queries are running against it.

To that end, you can run this script:

SELECT        session_id, text, requested_memory_kb,
    granted_memory_kb, used_memory_kb
FROM            sys.dm_exec_query_memory_grants MG
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
ORDER BY 3 DESC

that will show you the /current/ allocation for any running queries. If you need greater detail down at the table index level, you can run this script:

select
count(*)as cached_pages_count, obj.name as objectname,
ind.name as indexname, obj.index_id as indexid
from sys.dm_os_buffer_descriptors as bd
inner join
(
SELECT p.object_id AS objectid, object_name(p.object_id) AS name, p.index_id, au.allocation_unit_id
FROM sys.allocation_units au INNER JOIN
sys.partitions p ON au.container_id = p.hobt_id AND (au.type = 1 OR au.type = 3)
union all
SELECT p.object_id AS objectid, object_name(p.object_id) AS name, p.index_id, au.allocation_unit_id
FROM sys.allocation_units au INNER JOIN
sys.partitions p ON au.container_id = p.partition_id AND au.type = 2
) as obj
on bd.allocation_unit_id = obj.allocation_unit_id
left outer join sys.indexes ind on obj.objectid = ind.object_id and obj.index_id = ind.index_id
where bd.database_id = db_id() and bd.page_type in ('data_page', 'index_page')
group by obj.name, ind.name, obj.index_id
order by cached_pages_count desc

Monday, October 16, 2017

IO Stalls and Latency

One of the more useful optimizations you can run on a database instance if you are unfortunate enough to be on old fashioned spinning media (as opposed to SSDs) is to split up files to reduce the amount of stalls. These can be either database or log files, for either user or system databases, reads or writes. In an earlier post I gave a couple of queries you could run to get a quick gross overview, but now let's get further into the weeds and fine tune some of those slower files in detail. Run these queries:

select db_name(mf.database_id) as database_name, mf.physical_name,
(vfs.num_of_bytes_read / vfs.io_stall_read_ms) as BytesPerStall,
mf.type_desc, vfs.num_of_reads, vfs.num_of_bytes_read, vfs.io_stall_read_ms,
vfs.io_stall, vfs.size_on_disk_bytes
from sys.master_files mf
join sys.dm_io_virtual_file_stats(NULL, NULL) vfs
on mf.database_id=vfs.database_id and mf.file_id=vfs.file_id
where vfs.io_stall_read_ms > 0
order by 3

select db_name(mf.database_id) as database_name, mf.physical_name,
(vfs.num_of_bytes_written / vfs.io_stall_write_ms) as BytesPerStall,
mf.type_desc, vfs.num_of_writes, vfs.num_of_bytes_written, vfs.io_stall_write_ms,
vfs.io_stall, vfs.size_on_disk_bytes
from sys.master_files mf
join sys.dm_io_virtual_file_stats(NULL, NULL) vfs
on mf.database_id=vfs.database_id and mf.file_id=vfs.file_id
where vfs.io_stall_write_ms > 0
order by 3

This returns two resultsets, the first is problem stalls on reads, the second is problem stalls on writes. Do pay attention to the column showing number of bytes read and written; likely it's a waste of time to optimize the smaller files. Now that you know generally who your problem children are, you have a couple of matters of recourse. If the stalls are happening in a log file, then it's time to increase the number of log files or move them to a faster device. If the latency is in a table (a file of type "rows") then you need to dig a little deeper; run this query attached to the database that concerns you.

SELECT TableName = object_name(s.object_id),
Reads = SUM(user_seeks + user_scans + user_lookups), Writes = SUM(user_updates)
FROM sys.dm_db_index_usage_stats AS s
INNER JOIN sys.indexes AS i ON s.object_id = i.object_id
AND i.index_id = s.index_id
WHERE objectproperty(s.object_id,'IsUserTable') = 1
AND s.database_id = DB_ID()
GROUP BY object_name(s.object_id)
ORDER BY writes DESC

Of course change the Order By clause for whether you are researching the stalls on reads or on the writes. Likely the top couple of tables in this returned set should be spun off to their own filegroup with multiple files (or possibly their clustered index needs reorganizing -- check their fragmentation).

Wednesday, September 13, 2017

Cascade Blockage

Sometimes the answer to why a SQL process gets blocked relates to overall network dataflow. In these cases you may not be able to tease out the issue until some event stresses your server past a critical threshold. A real world example will help to elucidate this.

We fulfill our eMail touchpoints through a service hosted by Rackspace. They communicate clicks, bounces, and opens back to a webservice on one of our servers; that webservice connects to a couple of databases on internal SQL servers. This is a fairly typical modern client-vendor arrangement. The webservice constantly runs around 90 transactions per second through the SQL database, a perfectly reasonable load.

One day a developer opened up a query that locked the eMail transaction table for a couple of minutes. The webservice began throwing timeout messages to our internal notifications queue. Even after the developer closed their session the timeout notifications continued from the webservice for a couple more hours.

When I pulled up sp_who2 there were hundreds of connections to SQL from the webservice, where normally during the day I might glimpse one or two. After a couple of hours the problem abated. When I awoke the next morning my subconscious had pieced together the problem, along with this nice metaphor.

Suppose you're driving along on the 101 freeway with light traffic and way up ahead you see some tailights glisten. In anticipation you remove your foot from the gas and let the car slow down on its own. Nearby traffic does the same, and so there is a brief slowing that passes quickly as you resume your normal speed. Next scenario: as you're driving along the 101 in light traffic the fog starts to thicken, so you slow down. Suddenly up ahead tailights! You switch to the brake, slow down, and then once you're past the distraction resume your slightly slower speed. There is a brief and persistent restricted slowing at the original location.

Now take this scenario: on the 101 in very heavy traffic, the fog thickening. The car in front of you slows, you brake, the car behind you doesn't have much room so they brake hard, and the car behind that hits them. Accident. Now that traffic has stopped a chain reaction of accidents keeps trickling back. One brief constriction, and now the blockage will be there for many hours.

When I examined the queries that were executing from the webservice I noticed they did a Select top(1) * of the row in the event table matching the key for the transaction. Unfortunately one of the fields in that row was an XML copy of the actual eMail we sent, usually around 10 MB in size (you know, graphics and all). At the typical rate of 90 transactions per second this was just about the right size to squeeze through the 10 gigabit switch between the two servers.

As soon as somebody fired off another large query on the same table though, the switch temporarily got overwhelmed, and transactions that would clear in real time suddenly were starting to queue up. Once multiple transactions were running against the same table the extra load began slowing their execution. A constriction in heavy traffic.

The solution was to change the webservice to use a view that excluded the XML column. But frankly, if the problem had never happened, I would never have specifically looked for it, as it was just below the radar.

Tuesday, November 15, 2016

Improving Bad Queries

The canned reports available from SSMS are great for starting to identify the worst of the Bad Queries. If you right click on an instance, then go to reports, standard reports, you can then see the top queries by total IO and by CPU (these are two separate reports). Be aware however these are based upon the underlying queries currently in the plan cache (they aren't the worst queries forever). One thing that's a bit frustrating is you can't copy and paste directly from these reports. A little hidden trick though is you can right-click on the report, and then choose Export to Excel. Now you've got some queries to optimize.

Another way to obtain this same information is to examine RedGate monitor; if you click to see the detail of an alert and then scroll down a bit, you can click on the Top-10 queries. This shows essentially the same thing as the SSMS report, except it's a snapshot from the time when the alert occurred. A great practice to follow is to view the RedGate graphs of disk reads over the last 24 hours, and then look for instances of high activity. Find an alert triggered around the same time and look at its detail for Top-10 queries: now you have more to fix.

Once you've identified a handful of troublemakers it's time to put each one under the microscope. One way of course is to examine the query plan, but first I like to paste the SQL for the query into a new-query window, and then highlight the whole thing, right click, and view in designer. Look at joins by moving the table boxes around. Is there a way to arrange them such that none of the lines representing the joins cross?

Once you've done that, look and see if there's a path around the joins taking you in a circle (SQL doesn't like to resolve such things). See if any of the joins show an f(x) symbol, another troublesome indicator. Now look at the tables out at the far reaches of the diagram, linked by a single join to the mass of tables. Do these extremities have any columns or filters selected? If not then you don't need the table in the query (but double check the SQL first before you delete it -- sometimes the table is in a Case clause). If no columns are selected and the filter is a foreign key, can you instead just specify its column value in the where clause on the table it joins?

Now examine the label grid toward the bottom of the designer. Are the filter criteria aligned properly in adjacent columns? Sometimes a developer will confuse the precedence of AND, OR, and parentheses (returning way more rows than they intended) and this will become apparent when looking at how the filters are set up. After all that it's time to now look at the query plan (fodder for a whole 'nother post).

Tuesday, August 23, 2016

Auto Trap

My employer is fairly liberal in allowing reasonably competent folks direct access to SSMS for querying into our databases. This approach encourages folks to learn the structure of the data for themselves, and to answer most of the ad hoc work they might otherwise need to investigate. The disadvantage though is occasionally an employee of lesser experience will execute a couple of queries to bog the system down. As a DBA I need awareness of the clumsier executions without constantly dropping everything to review each and every medium-grade RedGate monitor alert. The halfway elegant solution I came up with looks like this:

DECLARE @sess varchar(7)
DECLARE @emMsg varchar(270)
DECLARE @sqlRun varchar(200)

set @sess = NULL

select TOP(1) @sess = ql.session_id, @sqlRun = ql.query_text from
(
select r.session_id
   ,substring(qt.text,r.statement_start_offset/2,
   (case when r.statement_end_offset = -1
   then len(convert(nvarchar(max), qt.text)) * 2
   else r.statement_end_offset end - r.statement_start_offset)/2)
   as query_text, r.cpu_time, r.total_elapsed_time
   , r.reads, r.logical_reads
from sys.dm_exec_requests as r
cross apply sys.dm_exec_sql_text(sql_handle) as qt
inner join sys.dm_exec_sessions as es on r.session_id = es.session_id
where es.is_user_process = 1
) as ql
where ql.query_text not like '%where%'
and ql.query_text not like '%top%'
and ql.logical_reads > 200000

IF @sess is not null

BEGIN

set @emMsg = 'Session id ' + @sess + ' Running ' + @sqlRun

EXEC msdb.dbo.sp_send_dbmail
    @profile_name='SqlEntNotifications',
    @recipients = 'jdchapman@mycompany.com',
    @subject = 'Unbounded Query Alert',
    @body = @emMsg
END

I created a SQL Server Agent job that runs this every ten minutes. This sends me an eMail when something is spinning up a lot of read activity, but it also filters out queries to just those missing a Where clause or a Top restriction.

It's not perfect enough that I might automatically kill SPIDs; sometimes it sends along an eMail from a long-running production job -- actually a good thing, as it alerts me to places I might have to trace or optimize later. It does tend to help me keep the SQL instance generally out of trouble though.

Monday, May 23, 2016

DMVs for Current Activity

Aside from the three external tools I mentioned in this previous post, I also keep a pocketful of queries and stored procs that I run to periodically monitor performance. I'm afraid I can't take credit for any of these: some I've picked up from reading various SQL books or online forums, others were written by quite skilled folks and you can download them from their website. Often times I modify them slightly to give me a better understanding of what I'm actually looking for or to hone them to show a more concentrated set of issues.

I have a separate group of DMV queries I use more for database tuning and analysis, but those in this post are the ones I'm constantly running to address alerted and critical real-time issues.

For a really fast (and low overhead) look at what is happening right now on an instance I use this variant of who2:

CREATE TABLE #sp_who2 (SPID INT,Status VARCHAR(255),
      Login VARCHAR(255),HostName VARCHAR(255),
      BlkBy VARCHAR(255),DBName VARCHAR(255),
      Command VARCHAR(255),CPUTime INT,
      DiskIO INT,LastBatch VARCHAR(255),
      ProgramName VARCHAR(255),SPID2 INT,
      REQUESTID INT)

INSERT INTO #sp_who2 EXEC sp_who2

SELECT     distinct SPID, Status, Login, HostName, BlkBy, DBName, Command, LastBatch, ProgramName
FROM        #sp_who2
where Command not in ('AWAITING COMMAND', 'TASK MANAGER', 'BRKR TASK')
and DBName is not null
and SPID <> @@SPID

drop table #sp_who2

The best thing about the results of this query is a quick glance at the BlkBy column tells me right off the bat who is blocking whom. If I get more concerned and don't mind the overhead of incurring a little extra CPU I run these two:

exec sp_whoisactive

select r.session_id
   ,r.status
   ,substring(qt.text,r.statement_start_offset/2,
   (case when r.statement_end_offset = -1
   then len(convert(nvarchar(max), qt.text)) * 2
   else r.statement_end_offset end - r.statement_start_offset)/2)
   as query_text, qt.dbid, qt.objectid, r.cpu_time, r.total_elapsed_time
   , r.reads,r.writes, r.logical_reads, r.scheduler_id
from sys.dm_exec_requests as r
cross apply sys.dm_exec_sql_text(sql_handle) as qt
inner join sys.dm_exec_sessions as es on r.session_id = es.session_id
where es.is_user_process = 1
order by r.cpu_time desc

The first of this set, WhoIsActive, is an invaluable freeware stored proc by Adam Machanic: a quick click on the XML query column pops up a new window showing you the precise SQL that is executing. By all means if you find this utility useful I'd recommended making a donation to this guy on his website.

If my various external monitors seem to suggest IO bottlenecks then I like to run this query:

SELECT        DB_NAME(dbid) AS ConnectedToDB, hostname, program_name,
    loginame, cpu, physical_io, memusage, login_time, last_batch, status
FROM            sysprocesses
WHERE        (status <> 'background') AND (program_name NOT LIKE 'SQLAgent%') AND (cpu > 0)
ORDER BY physical_io DESC
GO

This shows the currently connected IO hogs. You can then see if they have any long running queries with the WhoIsActive. If instead they are running millions of little queries with an ETL tool or some other application, then you will need to run sp_sqltrace, another work of art by a serious SQL developer that you can download for free.

Finally if the CPU is getting pegged chances are either it's tons of recompiles, or a missing index. This will show the recompiles:

SELECT        sys.dm_exec_cached_plans.refcounts, sys.dm_exec_cached_plans.usecounts, sys.dm_exec_cached_plans.cacheobjtype, sys.dm_exec_cached_plans.objtype,
                          left(sys.dm_exec_sql_text.text, 50) as SQLpartial, sys.dm_exec_sql_text.text as SQLstmt
into #tempCompiles
FROM            sys.dm_exec_cached_plans CROSS APPLY sys.dm_exec_sql_text(sys.dm_exec_cached_plans.plan_handle)
WHERE sys.dm_exec_sql_text.text LIKE '%SELECT %'
AND sys.dm_exec_sql_text.text NOT LIKE '%backup%'
and usecounts = 1

SELECT        SUM(refcounts) AS howMany, SQLpartial
into #tempCompileTotals
FROM            #tempCompiles
GROUP BY SQLpartial

SELECT        SUM(CT.howMany) AS howManyCompiles, MIN(CD.SQLstmt) AS sqlExample
FROM            #tempCompiles AS CD INNER JOIN
                         #tempCompileTotals AS CT ON CD.SQLpartial = CT.SQLpartial
GROUP BY CD.SQLpartial
ORDER BY howManyCompiles DESC

DROP table #tempCompiles
DROP table #tempCompileTotals

GO

If you have bad recompiles then the best bet is to chat with the responsible developer and have them create less dynamic queries, perhaps by storing the feeding parameters in static tables. As a last resort you can set the Forced Parameterization option on the instance, but this has some other compensating drawbacks.

Here's the query I like to use for missing indices:

SELECT        mid.statement, round(migs.user_seeks * migs.avg_total_user_cost * migs.avg_user_impact / 100, 0) as imPactor,
        migs.unique_compiles, migs.user_seeks, migs.user_scans, migs.last_user_seek, migs.last_user_scan,
        migs.avg_total_user_cost, migs.avg_user_impact, mig.index_group_handle, mig.index_handle,
                         mid.object_id, mid.equality_columns, mid.inequality_columns,
                         mid.included_columns
FROM            sys.dm_db_missing_index_group_stats AS migs INNER JOIN
                         sys.dm_db_missing_index_groups AS mig ON migs.group_handle = mig.index_group_handle INNER JOIN
                         sys.dm_db_missing_index_details AS mid ON mig.index_handle = mid.index_handle
WHERE        (mid.database_id = DB_ID())
and migs.last_user_seek > GETDATE() - 14
ORDER BY 2 DESC

You don't have to build all the missing indices, but after some use you'll be able to eye those that stand out above a critical threshold. Also note that if this suggests tons of "included" columns then the issue on your table is larger than just a missing index -- likely you need to split the table widthwise or check if most of the columns should be marked "sparse."