|
Q and A
Asked and Answered
As part of our disaster recovery procedures, we try to get the (4-way) DB2 data sharing group up and certain critical applications running within the first 2 hours after a disaster has been declared. To accomplish this we have the following:
-
A dedicated MVS logical partition (LPAR) running at all times with its own dedicated coupling facility. Even though this system runs all the time, the only volumes actually online are those needed to bring the system up at IPL time.
- We have all DB2 catalog and directory tables on mirrored DASD along with both bootstraps and copy 1 of the active logs.
Critical application databases also reside on mirrored volumes. These mirrored volumes are copied as one consistency group every 2-3 hours to our other data center, where the disaster recovery processor lives.
Once a disaster has been declared, these mirrored volumes are brought online. We clean out the coupling facility structures for the SCA and LOCK1 for our data sharing group, and then we restart all four members of the group with single logging and perform a group restart to rebuild the coupling facility structures. This process has worked successfully for the past two years or so. However, in the last six-to-eight months it has been taking longer and longer to complete the restarts for all four members. This change coincides with the implementation of a new online application that's being rolled out in chunks. Our DSNZPARM checkpoint frequency used to be based on log records written, but we were seeing very uneven checkpoints across the four members. Some were issuing a checkpoint every 30 minutes to 1 hour, some every 1-3 hours during the online primetime window, and then every minute during the evening batch window. I changed the checkpoint frequency to 10 minutes (prime online) and five minutes (batch), and this has helped some. Is there anything else I can do to improve the performance? Our coupling facility is at CF 11, which is as high as we can go due to hardware limitations.
Robert Catterall responds:
Here's kind of a grab-bag of things to consider or try:
- Last year, the database team I'm part of at CheckFree Corp. applied the PTF for APAR PQ66444 to our DB2 for z/OS subsystems and saw restart times improve pretty dramatically. (I'm not sure if we measured the impact in data sharing group restart, but it sure did help in restart of a single failed member of the group.)
- Take a look at what you have specified for
PCLOSEN and PCLOSET. On a DB2 monitor statistics report or online display of statistics, what do you see for the number of data sets changed from read/write to read-only status (expressed as a per-minute rate)? Too much pseudo-closing means unwanted overhead, but too little can slow system restart.
- On group restart, of course, you have to deal with pages externalized to the group buffer pools (GBPs) but not yet "hardened" to disk at the time of the (simulated) disaster. What's your GBP checkpoint interval? How about the GBP write threshold percentages (I believe one is called the GBP castout threshold; the other may be called the class castout threshold)? Because you want a quick group restart, you might think of casting GBP pages out to disk somewhat aggressively.
- Do you have DB2 fast log apply set up? We have the ZPARM parameter
LOGAPSTG set to 100M (the maximum) in production. By the way, our lead DB2 for z/OS systems programmer once told me that if he went above 99 concurrent START DATABASE commands (to clear up pages on the LPL), it appeared to him that fast log apply was turned off.
- Because the new application is online, I have to believe commit frequency is not a big problem, unless some of the transactions have long run times. The new application doesn't have a batch component, does it?
- We have three mainframes in our production parallel sysplex, but we recently went from 3-way to 6-way data sharing (we now have two DB2 members per mainframe). We did this partly to improve restart performance, figuring that spreading the same workload over more members would result in each member having fewer log records to process at restart time.
- We took some steps to boost system throughput for example, we increased use of protected CICS-DB2 threads combined with the
RELEASE(DEALLOCATE) bind option), figuring that speeding transactions through the system would result in fewer locks being held at any point in time.
See a
complete archive of reader/author Q&As
.
|