Q and A

Asked and Answered

We have tables that are fairly large (more than 50M rows). We can't purge data for business reasons, but the probability of accessing a piece of data is far greater for recent data than it is for older data. Should we split the data into two partitions (old and recent) so that the partition containing the subset of rows likely to be accessed is smaller?

Robert Catterall responds:

That's a good question. If by "partition," you mean a partition or partitions in a DB2 for OS/390 partitioned tablespace, I don't know if I would recommend that approach if you never purge from the table (and a never-purge table makes me a little nervous). If you do purge, but only after a considerable period of time (some number of months or years), you might want to try partitioning on a date basis. For example, if you need to retain data in the table for three years, you could go with 156 partitions in a tablespace — one for each week in a three-year period. I would not recommend making a date column the partitioning key, because partitioning by a continuously ascending key is problematic at present. Instead, you could have a partition number column, with values from 1 through 156. A row going into a particular partition is assigned the associated partition number by way of this column. A separate table that maps dates to partition numbers would be used on insert to determine the correct partition number, and on retrieval to translate a requested date into a partition number. In this way, you can "wrap" around these 156 partitions as time progresses. If you have a brief "window" during which the data can be unavailable at the beginning of a new week, you can clear out an "old" partition very efficiently via a load utility REPLACE DD DUMMY operation. Otherwise, you'd clear out old data via SQL DELETE (following archive, if you need to archive the old data).

If you really never will purge data from this table, you might try replacing the one table with multiple tables. You could do this in a couple of ways:

1) Have a "current period" table holding whatever you consider to be the "current" data (a month's worth, three month's worth, whatever) and a history table (holding data older than that belonging to the "current" time period).

2) Have a table per time period (for example, a January 2002 table, a February 2002 table, and so on).

The nice thing about option 1 is that the table names are fixed, making coding pretty easy. The main disadvantage is the CPU expense of moving rows from the one table to the other as they age. If a row will be accessed multiple times while it's "current," the improved retrieval efficiency (better locality of reference, better buffer pool hit ratio) may out weigh the table move cost.

The second option could require more application code, in that an application may have to figure out which table to access. On top of that, you can't name a table via a host variable in an SQL statement; so, you either have to have a lot of alternative static statements in your program, and execute the right one based on the time period of interest, or you go with dynamic SQL, figuring out the table name and building the statement at execution time. I'll tell you that I'm much less opposed to dynamic SQL than I used to be. You might want to read the column I wrote about dynamic SQL last year for DB2 Magazine ( http://www.db2mag.com/db_area/archives/2001/q2/db2dba.shtml ).

Still referring to option 2, an alternative to smarter application programs is a smarter (and more high-maintenance) database. You could ensure that the current period table always has the same name by way of the DB2 table rename function. Trouble is, when you do the rename, you have to rebind programs referring to the target table. Thus, the rename approach might well require that the change be effected during a maintenance window.

Still another alternative would be to stick with your current scheme and have zero free space (or as close as you can get to that in the tablespace. That way, all new data would go to the end of the tablespace or partition, and you could get good locality of reference because current data rows are close to each other. The clusteredness of your table might deteriorate quite a bit, but would that be bad? Maybe not, depending on the nature of accesses to data in the table.


See a complete archive of reader/author Q&As .

CAREER CENTER
Ready to take that job and shove it?
SEARCH JOBS
RECENT JOB POSTINGS
CAREER NEWS
10 Search Engines You Don't Know About
Go beyond Google and get vertical. These specialized search sites will help you find the business information you need -- fast.

Subscribe to the new digital version of IBM Database Magazine
New Digital Version

Sponsored links:



Subscribe to the IBM Database Magazine Newsletter

Email Address *
First Name
Last Name
HTML Preference
HTML Text
 

Fields with * are required.

 




Visit these other IBM and TechWeb Partner Sites: :
Maximizing ROI Through Business Process Management (BPM) and Service-Oriented Architecture (SOA)
Internet Evolution – The Macrosite for News, Analysis, & Opinion About the Future of the Internet
Business Innovation – Technology Strategies and Solutions for Driving Business Success


:: IBM Database Magazine ::