DataGeek: Filtered Indexes Part 3: Data Archiving

In the past, I’ve often had to keep large amounts of transactional and/or inactive data around for historical reporting purposes (or maybe just because I’m a data pack rat by nature). Keeping this in its original table can make querying current data rather costly, so often the approach has been to move it to an archive table or even archive database.

This of course makes querying a bit more complex and requires the management of archiving the data itself. Filtered indexes give us an option to have many of the performance advantages of archiving the data while leaving it in place.

Let’s consider the following scenario. We have a customers table and orders table. It’s been determined that most of the time we will be querying on open orders, where the order table’s CompletedOn field is null. The following script will build the two tables:

--  Create Customers table
CREATE TABLE tblCustomers (
		CustomerId INT NOT NULL IDENTITY(1, 1)
		,CustomerName NVARCHAR(255) NOT NULL
	CONSTRAINT PK_tblCustomers PRIMARY KEY CLUSTERED 
	(
		CustomerId ASC
	)
)
GO

--  Create Orders table
CREATE TABLE tblOrders (
		OrderId INT NOT NULL IDENTITY(1, 1)
		,CustomerId INT NOT NULL
		,OrderDate DATETIME NOT NULL
		,OrderAmount MONEY NOT NULL
		,CompletedOn DATETIME NULL
	CONSTRAINT PK_tblOrders PRIMARY KEY NONCLUSTERED 
	(
		OrderId ASC
	)
)
GO

--  Add foreign key constraint for CustomerId
ALTER TABLE tblOrders  
	WITH CHECK ADD CONSTRAINT FK_tblOrders_tblCustomers 
		FOREIGN KEY(CustomerId)
		REFERENCES tblCustomers (CustomerId)
GO

The script below will create some test data. Depending on your machine you may want to tweak some of the parameters. As setup it will create ~100,000 customer records and randomly create between 0 and 100 order records per customer (~5,000,000 records in total).

--  Create test customers
DECLARE @CustomersTemp TABLE (
	CustomerName NVARCHAR(255))
DECLARE @RecordCount INT
SET @RecordCount = 0
WHILE @RecordCount < 100000
BEGIN
	INSERT INTO @CustomersTemp (
			CustomerName)
		VALUES (
			'Customer ' + CAST(@RecordCount AS NVARCHAR))
			
	SET @RecordCount = @RecordCount + 1
END

INSERT INTO tblCustomers (
		CustomerName)
	SELECT
			CustomerName
		FROM @CustomersTemp

--  Create test orders
DECLARE @OrdersTemp TABLE(
	CustomerId INT
	,OrderDate DATETIME
	,OrderAmount MONEY)
DECLARE @CustomerId INT;
DECLARE @OrderCount INT
DECLARE Customers CURSOR
	FOR
		SELECT
				CustomerId
			FROM tblCustomers;
			
OPEN Customers;

FETCH NEXT FROM Customers
	INTO @CustomerId;
	
WHILE @@FETCH_STATUS = 0
BEGIN
	SET @RecordCount = 0;
	SET @OrderCount = CAST((RAND() * 100.0) AS INT);
	WHILE @RecordCount < @OrderCount
	BEGIN
		INSERT INTO @OrdersTemp (
				CustomerId
				,OrderDate
				,OrderAmount)
			VALUES (
				@CustomerId
				,DATEADD(DAY, RAND() * 3650, '1/1/2000')
				,(RAND() * 1000))
				
		SET @RecordCount = @RecordCount + 1
	END

	FETCH NEXT FROM Customers
		INTO @CustomerId;
END

INSERT INTO tblOrders (
		CustomerId
		,OrderDate
		,OrderAmount)
	SELECT
			CustomerId
			,OrderDate
			,OrderAmount
		FROM @OrdersTemp

CLOSE Customers;
DEALLOCATE Customers;
GO

--  Mark orders completed.
UPDATE tblOrders SET
		CompletedOn = GETDATE()
	WHERE OrderDate < '1/1/2009'
GO

Note that the last part of the script will mark about 90% of the orders as having been completed.

Now for a query. The query below is getting the record count for all uncompleted orders greater than $500.

SELECT
		COUNT(*)
	FROM tblOrders
		INNER JOIN tblCustomers
			ON tblOrders.CustomerId = tblCustomers.CustomerId
	WHERE CompletedOn IS NULL
		AND OrderAmount > 500

At this point we haven’t defined any indexes other than the primary keys of the tables. On my development server the query at this point the query is taking about 120ms.

Adding the unfiltered index below reduces that time to ~57ms.

--  Add index on OrderAmount
--DROP INDEX IX_tblOrders_OrderAmount ON tblOrders
GO

CREATE NONCLUSTERED INDEX IX_tblOrders_OrderAmount 
	ON tblOrders (
		OrderAmount ASC
		,CompletedOn ASC
	)
GO

Finally, adding the filtered index below reduces the time to ~34ms.

--  Add filtered index on OrderAmount
--DROP INDEX IX_tblOrders_OrderAmount_Current ON tblOrders
GO

CREATE NONCLUSTERED INDEX IX_tblOrders_OrderAmount_Current 
	ON tblOrders (
		OrderAmount ASC
	)
	INCLUDE (CompletedOn)
	WHERE CompletedOn IS NULL
GO

Not only is the query execution time significantly reduced, but also the space required to store the index is reduced to about 10% in this case. This space savings could be used to add additional frequently used fields as key or included fields into the index to speed queries even more. Note that in this case I’m simply including the CompletedOn field rather than adding it as a key since it’s implicitly keyed by the WHERE clause.

If you review the execution plan for the query, you’ll notice that it automatically reverts to the unfiltered index if you remove the CompletedOn filter from the WHERE clause. Thus, you could essentially have a separate set of indexes for current and archived data.

One final note. In my experience with filtered indexes, I’ve found that filtering on a BIT field doesn’t seem to actually filter. You end up with all the rows indexed as if there were no filter on it. Has anyone else found that to be the case?

Additional Info

Click here to download ZIP file with all SQL scripts

Filtered Index Design Guidelines – Microsoft TechNet (http://technet.microsoft.com/en-us/library/cc280372.aspx)

Article detailing issues with filtered indexes and SET options - Interesting issue with Filtered indexes

DataGeek

Friday, February 4, 2011

Filtered Indexes Part 3: Data Archiving

Additional Info

No comments:

Post a Comment

Followers

My Blog List