Sujet : RX2800 sporadic disk I/O slowdowns
De : usenet (at) *nospam* cropcircledogs.com (Richard Jordan)
Groupes : comp.os.vmsDate : 18. Oct 2024, 19:26:53
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <veu99d$3derp$1@dont-email.me>
User-Agent : Mozilla Thunderbird
RX2800 i4 server, 64GB RAM, 4 processors, P410i controller with 10 each 2TB disks in RAID 6, broken down into volumes.
We periodically (sometimes steady once a week, but sometimes more frequent) one overnight batch job take much longer than normal to run. Normal runtime about 30-35 minutes will take 4.5 - 6.5 hours. Several images called by that job all run much slower than normal. At the end the overall CPU and I/O counts are very close between a normal and a long job.
The data files are very large indexed files. Records are read and updated but not added in this job; output is just tabulated reports.
We've run monitors for all and disk and also built polling snapshot jobs that check for locked/busy files, other active batch jobs, auto-checked through system analyzer looking for any other processes accessing the busy files at the same time as the problem batch (two data files show long busy periods but we do not show any other process with channels to that file at the same time except for backup, see next).
The backups start at the same time, but do not get to the data disks until well after the problem job normally completes; that does cause concurrent access to the problem files but it occurs only when the job has already run long. so it is not the cause Overall backup time is about the same regardless of how long the problem batch takes.
Monitor during a long run shows average and peak I/O rates to the disks with busy files at about 1/2 of what they do for normal runs. We can see that in the process snapshots too; the direct i/o count on a slow run increases much more slowly than on a normal run but both normal and long runs end up with close to the same CPU time and total I/Os.
Other jobs in monitor are somewhat slowed down but nowhere near as much (and they do much less access).
Before anyone asks, the indexed files could probably use a cleanup/rebuild, but if thats the cause would we see periodic performance issues? I would expect them to be constant.
There is a backup server available, so I'm going to restore backups of the two problem files to it and do rebuilds to see how long it takes; that will determine how/when we can do it on the production server.
So something is apparently causing it to be I/O constrained but so far we can't find it. Same concurrent processes, other jobs don't appear to be slowed down much (but may be much less i/o sensitive or using data on other disks, I threw that question to the devs).
Is there anything in the background below VMS that could cause this? The controller doing drive checks or other maintenance activities?
Thanks for any ideas.