On 10/18/2024 8:00 PM, Waldek Hebisch wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
On 10/18/2024 1:30 PM, Waldek Hebisch wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
There, runtime diagnostics are the only alternative for hardware
revalidation, PFA and diagnostics.
>
How commonly are such mechanisms implemented? And, how thoroughly?
>
This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations.
>
Not all devices are covered by "regulations".
Well, if device matters then there is implied liabilty
and nobody want to admit doing bad job. If device
does not matter, then answer to the original question
also does not matter.
In the US, ANYTHING can result in a lawsuit. But, "due diligence"
can insulate the manufacturer, to some extent. No one ever
*admits* to "doing a bad job".
If your doorbell malfunctions, what "damages" are you going
to claim? If your garage door doesn't open when commanded?
If your yard doesn't get watered? If you weren't promptly
notified that the mail had just been delivered? Or, that
the compressor in the freezer had failed and your foodstuffs
had spoiled, as a result?
The costs of litigation are reasonably high. Lawyers want
to see the LIKELIHOOD of a big payout before entertaining
such litigation.
And, the *extent* to which testing is done is the subject
addressed; if I ensure "stuff" *WORKED* when the device was
powered on (preventing it from continuing on to its normal
functionality in the event that some failure was detected),
what assurance does that give me that the device's integrity
is still intact 8760 hours (1 yr) hours later? 720 hours
(1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????
What to test is really domain-specific. Traditional thinking
is that computer hardware is _much_ more reliable than
software and software bugs are major source of misbehaviour.
That hasn't been *proven*. And, "misbehavior" is not the same
as *failure*.
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure. For example,
That used to be the thinking with DRAM but studies have shown
that *hard* failures are more common. These *can* be found...
*if* you go looking for them!
E.g., if you load code into RAM (from FLASH) for execution,
are you sure the image *in* the RAM is the image from the FLASH?
What about "now"? And "now"?!
at low safety level you may assume that hardware of a counter
generating PWM-ed signal works correctly, but you are
supposed to periodically verify that configuration registers
keep expected values.
Why would you expect the registers to lose their settings?
Would you expect the CPUs registers to be similarly flakey?
IIUC cristal osciators are likely to fail
so you are supposed to regularly check for presence of the clock
and its frequency (this assumes hardware design with a backup
clock).
Even if some safety critical software
does not contain them, nobody is going to admit violationg regulations.
And things like PLC-s are "dual use", they may be used in non-safety
role, but vendors claim compliance to safety standards.
>
So, if a bit in a RAM in said device *dies* some time after power on,
is the device going to *know* that has happened? And, signal its
unwillingness to continue operating? What is going to detect that
failure?
I do not know how PLC manufactures implement checks. Small
PLC-s are based on MCU-s with static parity protected RAM.
This may be deemed adequate. PLC-s work in cycles and some
percentage of the cycle is dedicated to self-test. So big
PLC may divide memory into smallish regions and in each
cycle check a single region, walking trough whole memory.
What if the bit's failure is inconsequential to the operation
of the device? E.g., if the bit is part of some not-used
feature? *Or*, if it has failed in the state it was *supposed*
to be in??!
I am affraid that usually inconsequential failure gets
promoted to complete failure. Before 2000 checking showed
that several BIOS-es "validated" date and "incorrect" (that
is after 1999) date prevented boot.
If *a* failure resulted in a catastrophic failure, things would
be "acceptable" in that the user would KNOW that something is
wrong without the device having to tell them.
But, too often, faults can be "absorbed" or lead to unobservable
errors in operation. What then?
Somewhere, I have a paper where the researchers simulated faults
*in* various OS kernels to see how "tolerant" the OS was of these
faults (which we know *happen*). One would think that *any*
fault would cause a crash. Yet, MANY faults are sufferable
(depending on the OS).
Consider, if a single bit error converts a "JUMP" to a "JUMP IF CARRY"
but the carry happens to be set, then there is no difference in the
execution path. If that bit error converts a "saturday" into a
"sunday", then something that is intended to execute on weekdays (or
weekends) won't care. Etc.
Historically OS-es had a map of bad blocks on the disc and
avoided allocating them. In principle on system with paging
hardware the same could be done for DRAM, but I do not think
anybody is doing this (if domain is serious enough to worry
about DRAM failures, then it probaly have redundant independent
computers with ECC DRAM).
Using ECC DRAM doesn't solve the problem. If you see errors
reported by your ECC RAM (corrected errors), then when do
you decide you are seeing too many and losing confidence that
the ECC is actually *detecting* all multibit errors?
With a "good" POST design, you can reassure the user that the
device *appears* to be functional. That the data/code stored in it
are intact (since last time they were accessed). That the memory
is capable of storing any values that is called on to preserve.
That the hardware I/Os can control and sense as intended, etc.
>
/But, you have no guarantee that this condition will persist!/
If it WAS guaranteed to persist, then the simple way to make high
reliability devices would be just to /never turn them off/ to
take advantage of this "guarantee"!
Everything here is domain specific. In cheap MCU-based device main
source of failurs is overvoltage/ESD on MCU pins. This may
kill the whole chip in which case no software protection can
help. Or some pins fail, sometimes this may be detected by reading
appropiate port. If you control electic motor then you probably
do not want to sent test signals during normal motor operation.
That depends on HOW you generate your test signals, what the hardware
actually looks like and how sensitive the "mechanism" is to such
"disturbances". Remember, "you" can see things faster than a mechanism
can often respond. I.e., if applying power to the motor doesn't
result in an observable load current (or "micromotion"), then the
motor is likely not responding.
But you are likely to have some feedback and can verify if feedback
agrees with expected values. If you get unexpected readings
you probably will stop the motor.