Do you remember the first time you worked with a million-row database table? Maybe some of you haven't yet. For me, it brought up a couple of different emotions.
First was a sense of awe and accomplishment. "I'm working on a million-row table. I'm in the big leagues now." I was sure that I could prove my coding skills against the beast. Plus, it opened a whole new arena for performance tweaking, something I enjoy doing.
Later, it became annoyance. "Why does this have to be a million-row table? This sucks." There probably is a very good reason the table is the size it is and there's a good chance the developer does not have the option to change that. But what if you could?
By nature, database data changes. But the nature of that change is different and each should have a targeted approach to fixing it. What scenarios could we have in a million-row table?
- Static data that is appended to. This would be something like transactions. The data is frozen once added and can't be changed.
- Static data that is updated. This would be like statistical data like user accounts. The row is created, then is updated regularly with things like last login, last purchase, total files downloaded, etc.
- Dynamic data that is bulk managed. (I'll use the terms "static" and "dynamic" to indicate whether the data is permanent or can be deleted.) This would be like scheduled data refreshes.
- Dynamic data that is individually managed. This would be where individual rows get added, updated, or deleted.
The key to efficiency in databases is indexes. Duh. And you know that indexes take a beating on data changes. The first thought that comes to mind is "Let's split the big table into smaller tables." That helps out best in scenario 1 and 3 because the indexes would be built once and left alone. The followup thought is "What about queries that span data in multiple tables?" Maybe use Views? Indexed Views? We'll look a bit closer at this soon.
You have to factor in the types of queries you will be using against the database. Are they single row results, multi-row results based on a single field or multiple fields, are there going to be joins against the table?
Here's some interesting testing. Create a table for 9 users:
CREATE TABLE [dbo].[Users](
[ID] [int] NOT NULL,
[Name] [varchar](50)
) ON [PRIMARY]
Create a transaction table:
CREATE TABLE [dbo].[Transactions](
[TransactionID] [uniqueidentifier] NOT NULL,
[UsersID] [int] NOT NULL,
[TransactionDate] [datetime] NOT NULL,
[TransactionAmount] [float] NOT NULL
) ON [PRIMARY]
Now populate the transaction table with 1.5 million rows. Go ahead, start typing. Ok, here's a script:
set nocount on
declare @counter bigint
set @counter=0
while @counter<1500000
begin
insert transactions
select newid(),(datepart(ms,getdate()) % 9)+1,getdate(),datepart(ms,getdate())+(datepart(ms,getdate())*.01)
set @counter=@counter+1
if (@counter % 10000)=0 print 'Now inserting row: ' + cast(@counter as varchar)
end
That should give you a fairly random distribution of transactions among 9 users with random transaction values. It took me 10 minutes to run and my db is 74MB. Now we can try some tests. Don't forget, you need to clear your cache after each test otherwise subsequent queries will be unrealistically fast.
checkpoint
dbcc dropcleanbuffers
dbcc freeproccache
Just running a simple query against it with execution plan turned on takes 3 seconds:
select usersid,count(transactionamount) from transactions group by usersid
Putting a non-clustered index on UsersID changes the query performance to 1 second. But enough of that, we're talking about splitting tables. We're going to go with scenario 1 described above where new data is appended at the end. Since my population query took 10 minutes to run, let's create 5 tables of two minutes each.
select * into TX1
from transactions
where transactiondate<dateadd(n,2,(select min(transactiondate) from transactions))
select * into TX2
from transactions
where transactiondate>(select max(transactiondate) from tx1)
and transactiondate<dateadd(n,2,(select max(transactiondate) from tx1))
select * into TX3
from transactions
where transactiondate>(select max(transactiondate) from tx2)
and transactiondate<dateadd(n,2,(select max(transactiondate) from tx2))
select * into TX4
from transactions
where transactiondate>(select max(transactiondate) from tx3)
and transactiondate<dateadd(n,2,(select max(transactiondate) from tx3))
select * into TX5
from transactions
where transactiondate>(select max(transactiondate) from tx4)
and transactiondate<dateadd(n,2,(select max(transactiondate) from tx4))
And make a view unioning them all together:
CREATE VIEW [dbo].[v_TX]
AS
SELECT TransactionID, UsersID, TransactionDate, TransactionAmount
FROM dbo.TX1
UNION
SELECT TransactionID, UsersID, TransactionDate, TransactionAmount
FROM dbo.TX2
UNION
SELECT TransactionID, UsersID, TransactionDate, TransactionAmount
FROM dbo.TX3
UNION
SELECT TransactionID, UsersID, TransactionDate, TransactionAmount
FROM dbo.TX4
UNION
SELECT TransactionID, UsersID, TransactionDate, TransactionAmount
FROM dbo.TX5
And query:
select usersid,count(transactionamount) from v_tx group by usersid
Now how is the performance? AHHHHHHHH!! 1.5 minutes! Well, to be fair, that is grouping by UsersID when the partioning is by transactionDate. And there are no indexes on the TX* tables. Let's start with indexes.
create nonclustered index IX_UsersID on TX1(UsersID)
create nonclustered index IX_UsersID on TX2(UsersID)
create nonclustered index IX_UsersID on TX3(UsersID)
create nonclustered index IX_UsersID on TX4(UsersID)
create nonclustered index IX_UsersID on TX5(UsersID)
Query again, and now it's... 55 secs. Ahhhhhhh. Did anyone notice the problem? I used UNION instead of UNION ALL. What's the difference? About 55 seconds. Because without the ALL, the server has to check for duplicates. Create a new view using UNION ALL and check the differences in the execution plan.
So, right now, looking at the client statistics running the same query against the full table or against the five partioned tables, I have a total execution time of 750 for the partitioned tables and 900 for the full table. It seemed to make a positive difference. But that's just one scenario. I'll have to explore the other scenarios later.