IBM i 7.2: Finding the job that is killing your system

In my last article, I mentioned that it would be nice if there was tool that you could use to find the cause of performance problems on your system. I knew that IBM had been improving the Performance Tools that came with IBM i but I hadn’t realised just how much it had packed in.

The level of detail you can have access to as standard is literally mind-blowing! Well, it is if your mind is anything like mine. And even if you are a little more normal than me when it comes to things that blow your mind, at the very least you should find it interesting.

What’s probably most staggering of all is that IBM seems to have created an interface for this information that is easy to understand and, given the complexity of what it is showing, surprisingly easy to navigate.

This all part of IBM Navigator for i which is free-of-charge with your standard system software, runs on your IBM i server by default and is accessible through a standard web browser (tip: for best results use Firefox).

The keys to this for me were the simple Health Indicator screens that tell you where to start to look for your performance problem. Below is an example from one of my test systems, on which I placed a workload which is light on processor and memory but very disk-hungry.

The ocean of green for the CPU, memory, response time and DB tell me that the problem does not lie there. Instead, your eye is immediately drawn to the red section on the disk and now we know where to start looking.

Taking a step back for a moment, when most people have a performance problem they immediately type WRKACTJOB or maybe WRKSYSACT and, to be honest, this is still the right thing to do as, in many cases, you will know your system well enough to spot the unusual.

But what these screen show you are just numbers. They don’t tell you if the numbers you are seeing are good, bad or downright ugly.

So, if, for example, all I had to go on was the screenshot below from WRKSYSACT, then I would most likely point you to processor as your first issue rather than disk.

This is partly because the elapsed time interval is ludicrously short (but this is often what I get sent) but mostly it is because I have no frame of reference to know whether the number of I/Os reported is out of tolerance for the system it is running on.

To work that out, I’d have to get the rack configuration listing and then plough through a number of manuals on performance guidelines and I/O capabilities for this exact combination of server model, disk controller(s), disk arms, RAID type(s) and, of course, disk model(s). By which time both I and the users of my problem server would probably have lost the will to live.

Finding the problem job

Switching back to the issue at hand which is finding the problem job. From the same browser interface, you can then drill down into the data to get to the root cause. In this case, the next step was to select the Disk Health Indicator view.

This showed that it was not that my disks where close to capacity but that I was getting poor disk response times and that this was possibly being caused by my disks being too busy.

Next stage is to drill a little deeper and see which jobs are using the most disk I/O and whether the problem is with read or writes or both. This is, again, just a couple of clicks and now we can get at the Wait Data and this is where it really starts to get interesting.

For each job on the system, in real time, I get to see exactly what makes up the response of the job. I can see how much time each job waits on processor, paging, disk, journaling, locks and more. So, in this case, I can quickly see that more than 50% of the job overall run time is based on waiting for the disk writes to complete.

Now, assuming that this was a valid workload and that the client wanted to run during normal hours, I know straight away that I need to increase their system’s ability to write data to disk.

This doesn’t mean that they need to migrate to SSDs. In fact, in this case, that investment, with these disk controllers, wouldn’t actually improve performance at all. Instead, I can see that an improved disk controller with a nice big write cache is where I should start.

If you are running v7.1 or later, regardless of what type of hardware you are running it on, the chances are that you already have everything you need to monitor performance. Better still, unless you’ve gone out of your way to stop it, the system will have been monitoring performance in the background so you will already have the data to analyse.

Check that you have the Admin web server running; there should be a series of ADMIN jobs in subsystem QHTTPSVR. If not, then you can start it with STRTCPSVR SERVER(*HTTP) HTTPSVR(*ADMIN). Then point your browser to the IP address of your server with port 2001, ie https://YourServerIP:2001. You should get a log in screen and from there the fun begins.

If the system you want to analyse is already running poorly or is only at v6.1, then you can always just save the performance data, restore to a different system which has more “va va voom” and analyse it on there. If you are doing the analytics on v7.1, get the latest PTFs. It really will help but if you have the option, analyse on v7.2. It is loads better!

There is so much more in this tool, I really don’t know where to start but I think it is worth at least one more article and I would welcome your input as to where to focus next. You can contact me via the contact form below or through my website.

Finally, a quick reminder. We have a couple of i-UG meetings coming up, one on September 11 at Norton Grange, Rochdale (which you can also read about on PowerWire here) and another in mid-November in the Greater London Area. I hope to see you there: more details at www.i-ug.com.