Those of you who have suffered from an unexpected cache battery failure (SRC nnnn8008) will know how much pain it can cause. Whilst there is never any danger to your data, in many cases the performance of your server can be reduced by a factor of ten!
So this month I thought I would show you a simple way you can check on them whilst avoiding the hassle of using SST (System Service Tools). Instead, we can use a simple API created by IBM to check the state of all your cache batteries and give you all the details in a single spool file.
What are Cache Batteries?
Let me start with a bit of background on disk controllers, disk caches and of course cache batteries. If you know this stuff already, then feel free to skip to the next section which explains how to use QSMBTTCC.
Back in old days when God was young and I had a fringe, we did not have RAID. For our crucial systems we would use the mirroring function built in to IBM i (OK it was OS/400 back then). Now, if I am completely honest here, for many of systems our disk data protection was based on a combination of diskette backups and praying that when a disk did die that the IBM engineer could pump the data of the failing disk to its replacement. Mercifully, these engineers succeeded with these minor miracles almost every time.
Moving back to this millennium, many of us rely on hardware based RAID controllers; these allow us to keep our data safe without having to sacrifice 50% of our RAW data capacity. Now these hardware based controllers are fast, that is the beauty of doing the work in the silicon on these cards but sadly, our spinning disks are still (relatively speaking) slow. Even the latest super solid state drives are snail like in their ability to write data when you compare them to IBM i’s appetite to write it out.
So, those clever boffins at IBM decided that they would introduce a small amount of fast memory that sites on the RAID card that could act as a buffer for these data write requests and so a Write Cache was born.
Now the problem we have here is that the only kind of memory that was fast enough to keep up was by its very nature more like RAM than disk and the problem we have with RAM is that it “volatile” which means when you turn off electricity the contents of the memory is lost and so IBM added batteries to keep the data in the write cache safe from unexpected power loss.
At this point you might be forgiven for thinking, “Surely it’s just a few bytes of data? Can it really be that big a deal?” Well in a word YES, IBM i prides itself on keeping your data safe and for many of their engineer the thought of unplanned data loss is truly the stuff of nightmares. But being practical here, modern disk caches are usually measured in Gigabytes not Mega or Kilobytes and a Gigabyte of data in your typical DB2 file goes a long way, especially when it is likely to be the data you use the most.
Sadly, batteries don’t last forever and IBM, being the cautious creatures they are, doesn’t want to risk a battery failing when it’s needed so they built in a semi intelligent timer which typically lasts 1000 days.
ProTip: If you keep your server in a warm environment, you can expect this to adjust the counter and your batteries will be turned off sooner.
The QSMBTTCC API was shipped as standard from IBM i v7.2 but it can be used as far back as far as v5.4, if you find you don’t have it simply install the appropriate PTF from the list below:
This one really could not be easier, it is a program provided by IBM, it has no parameters, so simply issue the following command CALL QSMBTTCC
This will generate a spool and automatically display it for you, note the spool file remains after you finish displaying it. The first page of this spool file shows you system information, this is useful but the really interesting stuff comes up when you page down, below is a sample screen shot.
It’s a busy screen but don’t be put off, the stuff you need to know is neatly listed in the bottom right.
In this example we can see we have four controller cards with Cache Batteries, three of which are fine and dandy (DC01, DC02, DC06) as they all have over 1000 days to error (this is when IBM i turns them off as a precaution)
But, poor old DC07 only has 2 days before you get the warning message and 80 days until its cache battery is turned off.
ProTip: When the system issues the warning it can be a little scary and it does tend to send it to everyone who is signed on at the time. I would suggest you consider using the CHGSRVA to only send these critical messages to users who are *SYSOPR, *SECOFR or *SECADM, and not those who are *PGMR or *USER.
To do this, issue the CHGSRVA and remove the *PGMR and *USER from the list as show below:
Do I need down time to replace the cache batteries?
This is where I get to give every IBMer’s favourite answer “it depends” but unlike most of those answers, this one is quite easy to find out for sure. To do this we need to look at that screenshot again but this time we need to check the Maintainable Battery Pack info up in the top right:
In this example, we can see all four cards can have their battery pack changes whilst the system is up. It is quite common to see a mix of YES and NO here as the disk controller that runs the disks in the CEC on POWER5, 6 & 7 servers was not concurrently maintainable and so needed downtime to replace.
ProTip: Even if you have concurrently maintain batteries like this, DO NOT replace them until you have turned off the cache and the SAFELY REPLACE entry next to your RAID cards says YES.
How can I have good performance without cache batteries?
There is light at the end of this tunnel, as if you want great disk performance but you do not want cache batteries, then you do have options:
Firstly, buy a POWER8, the internal disk controller for the disks in the CEC is super-fast, has a huge cache and no cache batteries (I guess someone found a fast non-volatile cache memory or maybe even a fabulous capacitor technology?).
Secondly, use external disk. If you use a SAN to present your disks, then you can use Fibre Channel cards in your Power server to connect to the SAN, these do not have cache batteries. I admit the SAN does have batteries but these do not need changing every three years, in most cases they last the life of the SAN.
Can I automate this process?
If you are feeling like cutting a bit of code, there is no reason why you could not wrap this command in a bit of CL and get it to email you if the number of days until warning is less than 30 days on one or more adaptors. If you don’t know how to do this, just reach out to your friendly IBM Business Partner or ISV, they should be able to sort you out.
Nice to see you
We will then be back in my home town of Wolverhampton on Thursday 2nd November.
More details and a booking form are available at our website www.i-ug.co.uk