Tuesday, January 17, 2017

That Old Cleanup

Once you're consolidating information into your database tracking worksheet, one thing you'll quickly discover is that your shop has muckloads of old data lying about. Some are completely obsolete databases, some of them are cranky old former incarnations of databases that are now active elsewhere, duplicate test or QA versions, or ghosts of playthings that former employees forgot to delete. If you examine the directories where databases are housed you will even find remnant mdf and ldf files that somebody detached (but never deleted). Inside of existing active reporting databases you will find large obsolete tables, former versions of tables, and old historical data in the actively referenced tables.

Cleanup and archiving is a royal mess, but it's the first step in organizing your storage and then reallocating filegroups appropriately. Here are some general principles and guidelines to help resolve this headache.

The first principle is to err on the side of caution. Before you permanently remove any database take it offline for a couple of weeks to verify that no unanticipated processes are affected. If you have low-cost compressible storage available (say on an old file server) then zip the zombie databases there before you delete them.

If you encounter a database with lots of old date-versioned tables, the best trick I've found is to download a tool along the lines of TOAD. With their database explorer you can easily highlight rows, ctrl-c for copy, and then paste the resulting table names into an Excel spreadsheet. Sort out the tables you've targeted for deletion and run it past the staff to make sure they are indeed obsolete. Then you can prepend "drop table" in front of each row in Excel and copy and paste the rows back into a SQL editing window. Presto. (Keep a current backup just in case).

Archiving historical data from live tables is the most work. Depending upon the version of SQL you are running (Standard or Enterprise) you may tackle this with archiving or partioning, but in any case the last few steps typically require weekend or off-hours work to down the database. In the Standard environment my general approach is first to archive, then to shrink.

You can run the first step, archiving, while everyone is blissfully plugging away at work. You may choose to keep the archive in a new table in a separate database, or you may export it to a text file for zipping to your low cost compressible storage. To avoid having too severe an impact on production I like performing the archival copy (or export) with SSIS, running on a view that selects perhaps a half million rows at a time.

Once you've got your archive you're ready for the more consumptive task (on the weekend with production stopped). Identify whereabouts on the clustered index you will be "cutting" the table. Select top(0) from the old table into a new working table. Make sure the new target table has the correct cluster index defined and any auto-increment Identity column with a seed. Now set identity insert on for the new working target and use "Select Into" with a tablock hint to load the data you are saving into the new working table. Rescript the original table's non-cluster indices for a later step, remove any parent-child table constraints, and then you can drop the original table. Is your heart beating in your ears yet? Now rename the new work table back to the original table name and rebuild the non-cluster indices.  If you'd like you can recreate the parent-child referential integrity. There, that wasn't so bad now, was it?

Of course if you're on Enterprise edition you can use partitioned tables.  So much easier.