Sujet : Re: Predictive failures
De : blockedofcourse (at) *nospam* foo.invalid (Don Y)
Groupes : sci.electronics.designDate : 19. Apr 2024, 00:05:07
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <uvs5eu$2g9e9$2@dont-email.me>
References : 1 2
User-Agent : Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.2.2
On 4/18/2024 10:18 AM, Buzz McCool wrote:
On 4/15/2024 10:13 AM, Don Y wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
This reminded me of some past efforts in this area. It was never demonstrated to me (given ample opportunity) that this technology actually worked on intermittently failing hardware I had, so be cautious in applying it in any future endeavors.
Intermittent failures are the bane of all designers. Until something
is reliably observable, trying to address the problem is largely
wack-a-mole.
https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf
Thanks for that. I didn't find it in my collection so it's addition will
be welcome.
Sun has historically been aggressive in trying to increase availability,
especially on big iron. In fact, such a "prediction" led me to discard
a small server, yesterday (no time to dick with failing hardware!).
I am now seeing similar features in Dell servers. But, the *actual*
implementation details are always shrouded in mystery.
But, it is obvious (for "always on" systems) that there are many things
that can silently fail that will only manifest some time later -- if at
all and possibly complicated by other failures that may have been
precipitated by it.
Sorting out WHAT to monitor is the tricky part. Then, having the
ability to watch for trends can give you an inkling that something is
headed in the wrong direction -- before it actually exceeds some
baked in "hard limit".
E.g., only the memory that you actively REFERENCE in a product is ever
checked for errors! Bit rot may not be detected until some time after it
has occurred -- when you eventually access that memory (and the memory
controller throws an error).
This is paradoxically amusing; code to HANDLE errors is likely the least
accessed code in a product. So, bit rot IN that code is more likely
to go unnoticed -- until it is referenced (by some error condition)
and the error event complicated by the attendant error in the handler!
The more reliable your code (fewer faults), the more uncertain you
will be of the handlers' abilities to address faults that DO manifest!
The same applies to secondary storage media. How will you know if
some-rarely-accessed-file is intact and ready to be referenced
WHEN NEEDED -- if you aren't doing patrol reads/scrubbing to
verify that it is intact, NOW?
[One common flaw with RAID implementations and naive reliance on that
technology]