XMLpitstop.com   |  VBnetexpert.com   |  Community Credit  
 
 
Pitstop Search:  
in
 
Sign in | Join | Help
 
 
  Blog
    Home  
 
  Entries By Date
 
<October 2007>
SunMonTueWedThuFriSat
30123456
78910111213
14151617181920
21222324252627
28293031123
45678910
 
 
  Blog Categories
   
 
  Archives
    December 2008 (1)  
    November 2008 (3)  
    September 2008 (3)  
    August 2008 (3)  
    June 2008 (4)  
    May 2008 (2)  
    April 2008 (3)  
    March 2008 (3)  
    February 2008 (5)  
    December 2007 (4)  
    November 2007 (1)  
    October 2007 (3)  
 
  Syndication
    RSS  
    Atom  
    Comments RSS  
  .NET Flea Market  
 

Simpler times

Do you remember the first time you worked with a million-row database table? Maybe some of you haven't yet. For me, it brought up a couple of different emotions.

First was a sense of awe and accomplishment. "I'm working on a million-row table. I'm in the big leagues now." I was sure that I could prove my coding skills against the beast. Plus, it opened a whole new arena for performance tweaking, something I enjoy doing.

Later, it became annoyance. "Why does this have to be a million-row table? This sucks." There probably is a very good reason the table is the size it is and there's a good chance the developer does not have the option to change that. But what if you could?

By nature, database data changes. But the nature of that change is different and each should have a targeted approach to fixing it. What scenarios could we have in a million-row table?

  • Static data that is appended to. This would be something like transactions. The data is frozen once added and can't be changed.
  • Static data that is updated. This would be like statistical data like user accounts. The row is created, then is updated regularly with things like last login, last purchase, total files downloaded, etc.
  • Dynamic data that is bulk managed. (I'll use the terms "static" and "dynamic" to indicate whether the data is permanent or can be deleted.) This would be like scheduled data refreshes.
  • Dynamic data that is individually managed. This would be where individual rows get added, updated, or deleted.

The key to efficiency in databases is indexes. Duh. And you know that indexes take a beating on data changes. The first thought that comes to mind is "Let's split the big table into smaller tables." That helps out best in scenario 1 and 3 because the indexes would be built once and left alone. The followup thought is "What about queries that span data in multiple tables?" Maybe use Views? Indexed Views? We'll look a bit closer at this soon.

You have to factor in the types of queries you will be using against the database. Are they single row results, multi-row results based on a single field or multiple fields, are there going to be joins against the table?

Here's some interesting testing. Create a table for 9 users:

CREATE TABLE [dbo].[Users](
	[ID] [int] NOT NULL,
	[Name] [varchar](50)
) ON [PRIMARY]

Create a transaction table:

CREATE TABLE [dbo].[Transactions](
	[TransactionID] [uniqueidentifier] NOT NULL,
	[UsersID] [int] NOT NULL,
	[TransactionDate] [datetime] NOT NULL,
	[TransactionAmount] [float] NOT NULL
) ON [PRIMARY]

Now populate the transaction table with 1.5 million rows. Go ahead, start typing. Ok, here's a script:

set nocount on
declare @counter bigint
set @counter=0
while @counter<1500000
	begin
	insert transactions
	select newid(),(datepart(ms,getdate()) % 9)+1,getdate(),datepart(ms,getdate())+(datepart(ms,getdate())*.01)
	set @counter=@counter+1
	if (@counter % 10000)=0 print 'Now inserting row: ' + cast(@counter as varchar)
	end

That should give you a fairly random distribution of transactions among 9 users with random transaction values. It took me 10 minutes to run and my db is 74MB. Now we can try some tests. Don't forget, you need to clear your cache after each test otherwise subsequent queries will be unrealistically fast.

checkpoint
dbcc dropcleanbuffers
dbcc freeproccache

Just running a simple query against it with execution plan turned on takes 3 seconds:

select usersid,count(transactionamount) from transactions group by usersid

Putting a non-clustered index on UsersID changes the query performance to 1 second. But enough of that, we're talking about splitting tables. We're going to go with scenario 1 described above where new data is appended at the end. Since my population query took 10 minutes to run, let's create 5 tables of two minutes each.

select * into TX1
from transactions
where transactiondate<dateadd(n,2,(select min(transactiondate) from transactions))

select * into TX2
from transactions
where transactiondate>(select max(transactiondate) from tx1)
	and transactiondate<dateadd(n,2,(select max(transactiondate) from tx1))

select * into TX3
from transactions
where transactiondate>(select max(transactiondate) from tx2)
	and transactiondate<dateadd(n,2,(select max(transactiondate) from tx2))

select * into TX4
from transactions
where transactiondate>(select max(transactiondate) from tx3)
	and transactiondate<dateadd(n,2,(select max(transactiondate) from tx3))

select * into TX5
from transactions
where transactiondate>(select max(transactiondate) from tx4)
	and transactiondate<dateadd(n,2,(select max(transactiondate) from tx4))

And make a view unioning them all together:

CREATE VIEW [dbo].[v_TX]
AS
SELECT     TransactionID, UsersID, TransactionDate, TransactionAmount
FROM         dbo.TX1
UNION
SELECT     TransactionID, UsersID, TransactionDate, TransactionAmount
FROM         dbo.TX2
UNION
SELECT     TransactionID, UsersID, TransactionDate, TransactionAmount
FROM         dbo.TX3
UNION
SELECT     TransactionID, UsersID, TransactionDate, TransactionAmount
FROM         dbo.TX4
UNION
SELECT     TransactionID, UsersID, TransactionDate, TransactionAmount
FROM         dbo.TX5

And query:

select usersid,count(transactionamount) from v_tx group by usersid

Now how is the performance? AHHHHHHHH!! 1.5 minutes! Well, to be fair, that is grouping by UsersID when the partioning is by transactionDate. And there are no indexes on the TX* tables. Let's start with indexes.

create nonclustered index IX_UsersID on TX1(UsersID) 
create nonclustered index IX_UsersID on TX2(UsersID) 
create nonclustered index IX_UsersID on TX3(UsersID) 
create nonclustered index IX_UsersID on TX4(UsersID) 
create nonclustered index IX_UsersID on TX5(UsersID) 

Query again, and now it's... 55 secs. Ahhhhhhh. Did anyone notice the problem? I used UNION instead of UNION ALL. What's the difference? About 55 seconds. Because without the ALL, the server has to check for duplicates. Create a new view using UNION ALL and check the differences in the execution plan.

So, right now, looking at the client statistics running the same query against the full table or against the five partioned tables, I have a total execution time of 750 for the partitioned tables and 900 for the full table. It seemed to make a positive difference. But that's just one scenario. I'll have to explore the other scenarios later.

Comments

No Comments

About anachostic

That's me. Seek and ye shall find.
 
 
Copyright © . All Rights Reserved.
Powered by Community Server (Commercial Edition), by Telligent Systems