On 4/15/2024 1:32 PM, Edward Rawde wrote:
"Don Y" <blockedofcourse@foo.invalid> wrote in message
news:uvjn74$d54b$1@dont-email.me...
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
My conclusion would be no.
Some of my reasons are given below.
It always puzzled me how HAL could know that the AE-35 would fail in the
near future, but maybe HAL had a motive for lying.
Why does your PC retry failed disk operations? If I ask the drive to give
me LBA 1234, shouldn't it ALWAYS give me LBA1234? Without any data corruption
(CRC error) AND within the normal access time limits defined by the location
of those magnetic domains on the rotating medium?
Why should it attempt to retry this MORE than once?
Now, if you knew your disk drive was repeatedly retrying operations,
would your confidence in it be unchanged from times when it did not
exhibit such behavior?
Assuming you have properly configured a EIA232 interface, why would you
ever get a parity error? (OVERRUN errors can be the result of an i/f
that is running too fast for the system on the receiving end) How would
you even KNOW this was happening?
I suspect everyone who has owned a DVD/CD drive has encountered a
"slow tray" as the mechanism aged. Or, a tray that wouldn't
open (of its own accord) as soon/quickly as it used to.
The controller COULD be watching this (cuz it knows when it
initiated the operation and there is an "end-of-stroke"
sensor available) and KNOW that the drive belt was stretching
to the point where it was impacting operation.
[And, that a stretched belt wasn't going to suddenly decide to
unstretch to fix the problem!]
Back in that era I was doing a lot of repair work when I should have been
doing my homework.
So I knew that there were many unrelated kinds of hardware failure.
The goal isn't to predict ALL failures but, rather, to anticipate
LIKELY failures and treat them before they become an inconvenience
(or worse).
One morning, the (gas) furnace repeatedly tried to light as the
thermostat called for heat. Then, a few moments later, the
safeties would kick in and shut down the gas flow. This attracted my
attention as the LIT furnace should STAY LIT!
The furnace was too stupid to notice its behavior so would repeat
this cycle, endlessly.
I stepped in and overrode the thermostat to eliminate the call
for heat as this behavior couldn't be productive (if something
truly IS wrong, then why let it continue? and, if there is nothing
wrong with the controls/mechanism, then clearly it is unable to meet
my needs so why let it persist in trying?)
[Turns out, there was a city-wide gas shortage so there was enough
gas available to light the furnace but not enough to bring it up to
temperature as quickly as the designers had expected]
A component could fail suddenly, such as a short circuit diode, and
everything would work fine after replacing it.
The cause could perhaps have been a manufacturing defect, such as
insufficient cooling due to poor quality assembly, but the exact real cause
would never be known.
You don't care about the real cause. Or, even the failure mode.
You (as user) just don't want to be inconvenienced by the sudden
loss of the functionality/convenience that the the device provided.
A component could fail suddenly as a side effect of another failure.
One short circuit output transistor and several other components could also
burn up.
So, if you could predict the OTHER failure...
Or, that such a failure might occur and lead to the followup failure...
A component could fail slowly and only become apparent when it got to the
stage of causing an audible or visible effect.
But, likely, there was something observable *in* the circuit that
just hadn't made it to the level of human perception.
It would often be easy to locate the dried up electrolytic due to it having
already let go of some of its contents.
So I concluded that if I wanted to be sure that I could always watch my
favourite TV show, we would have to have at least two TVs in the house.
If it's not possible to have the equivalent of two TVs then you will want to
be in a position to get the existing TV repaired or replaced as quicky as
possible.
Two TVs are affordable. Consider two controllers for a wire-EDM machine.
Or, the cost of having that wire-EDM machine *idle* (because you didn't
have a spare controller!)
My home wireless Internet system doesn't care if one access point fails, and
I would not expect to be able to do anything to predict a time of failure.
Experience says a dead unit has power supply issues. Usually external but
could be internal.
Again, the goal isn't to predict "time of failure". But, rather, to be
able to know that "this isn't going to end well" -- with some advance notice
that allows for preemptive action to be taken (and not TOO much advance
notice that the user ends up replacing items prematurely).
I don't think it would be possible to "watch" everything because it's rare
that you can properly test a component while it's part of a working system.
You don't have to -- as long as you can observe its effects on other
parts of the system. E.g., there's no easy/inexpensive way to
check to see how much the belt on that CD/DVD player has stretched.
But, you can notice that it HAS stretched (or, some less likely
change has occurred that similarly interferes with the tray's actions)
by noting how the activity that it is used for has changed.
These days I would expect to have fun with management asking for software to
be able to diagnose and report any hardware failure.
Not very easy if the power supply has died.
What if the power supply HASN'T died? What if you are diagnosing the
likely upcoming failure *of* the power supply?
You have ECC memory in most (larger) machines. Do you silently
expect it to just fix all the errors? Does it have a way of telling you
how many such errors it HAS corrected? Can you infer the number of
errors that it *hasn't*?
[Why have ECC at all?]
There are (and have been) many efforts to *predict* lifetimes of
components (and, systems). And, some work to examine the state
of systems /in situ/ with an eye towards anticipating their
likelihood of future failure.
[The former has met with poor results -- predicting the future
without a position in its past is difficult. And, knowing how
a device is "stored" when not powered on also plays a role
in its future survival! (is there some reason YOUR devices
can't power themselves on, periodically; notice the environmental
conditions; log them and then power back off)]
The question is one of a practical nature; how much does it cost
you to add this capability to a device and how accurately can it
make those predictions (thus avoiding some future cost/inconvenience).
For small manufacturers, the research required is likely not cost-effective;
just take your best stab at it and let the customer "buy a replacement"
when the time comes (hopefully, outside of your warranty window).
But, anything you can do to minimize this TCO issue gives your product
an edge over competitors. Given that most devices are smart, nowadays,
it seems obvious that they should undertake as much of this task as
they can (conveniently) afford.
<
https://www.sciencedirect.com/science/article/abs/pii/S0026271409003667>
<
https://www.researchgate.net/publication/3430090_In_Situ_Temperature_Measurement_of_a_Notebook_Computer-A_Case_Study_in_Health_and_Usage_Monitoring_of_Electronics>
<
https://www.tandfonline.com/doi/abs/10.1080/16843703.2007.11673148>
<
https://www.prognostics.umd.edu/calcepapers/02_V.Shetty_remaingLifeAssesShuttleRemotemanipulatorSystem_22ndSpaceSimulationConf.pdf>
<
https://ieeexplore.ieee.org/document/1656125>
<
https://journals.sagepub.com/doi/10.1177/0142331208092031>
[Sorry, I can't publish links to the full articles]