Don Y <
blockedofcourse@foo.invalid> wrote:
On 10/18/2024 1:30 PM, Waldek Hebisch wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
Typically, one performs some limited "confidence tests"
at POST to catch gross failures. As this activity is
"in series" with normal operation, it tends to be brief
and not very thorough.
>
Many products offer a BIST capability that the user can invoke
for more thorough testing. This allows the user to decide
when he can afford to live without the normal functioning of the
device.
>
And, if you are a "robust" designer, you often include invariants
that verify hardware operations (esp to I/Os) are actually doing
what they should -- e.g., verifying battery voltage increases
when you activate the charging circuit, loopbacks on DIOs, etc.
>
But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
And, BIST might not always be convenient (as well as requiring the
user's consent and participation).
>
There, runtime diagnostics are the only alternative for hardware
revalidation, PFA and diagnostics.
>
How commonly are such mechanisms implemented? And, how thoroughly?
This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations.
Not all devices are covered by "regulations".
Well, if device matters then there is implied liabilty
and nobody want to admit doing bad job. If device
does not matter, then answer to the original question
also does not matter.
And, the *extent* to which testing is done is the subject
addressed; if I ensure "stuff" *WORKED* when the device was
powered on (preventing it from continuing on to its normal
functionality in the event that some failure was detected),
what assurance does that give me that the device's integrity
is still intact 8760 hours (1 yr) hours later? 720 hours
(1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????
What to test is really domain-specific. Traditional thinking
is that computer hardware is _much_ more reliable than
software and software bugs are major source of misbehaviour.
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure. For example,
at low safety level you may assume that hardware of a counter
generating PWM-ed signal works correctly, but you are
supposed to periodically verify that configuration registers
keep expected values. IIUC cristal osciators are likely to fail
so you are supposed to regularly check for presence of the clock
and its frequency (this assumes hardware design with a backup
clock).
Even if some safety critical software
does not contain them, nobody is going to admit violationg regulations.
And things like PLC-s are "dual use", they may be used in non-safety
role, but vendors claim compliance to safety standards.
So, if a bit in a RAM in said device *dies* some time after power on,
is the device going to *know* that has happened? And, signal its
unwillingness to continue operating? What is going to detect that
failure?
I do not know how PLC manufactures implement checks. Small
PLC-s are based on MCU-s with static parity protected RAM.
This may be deemed adequate. PLC-s work in cycles and some
percentage of the cycle is dedicated to self-test. So big
PLC may divide memory into smallish regions and in each
cycle check a single region, walking trough whole memory.
What if the bit's failure is inconsequential to the operation
of the device? E.g., if the bit is part of some not-used
feature? *Or*, if it has failed in the state it was *supposed*
to be in??!
I am affraid that usually inconsequential failure gets
promoted to complete failure. Before 2000 checking showed
that several BIOS-es "validated" date and "incorrect" (that
is after 1999) date prevented boot.
Historically OS-es had a map of bad blocks on the disc and
avoided allocating them. In principle on system with paging
hardware the same could be done for DRAM, but I do not think
anybody is doing this (if domain is serious enough to worry
about DRAM failures, then it probaly have redundant independent
computers with ECC DRAM).
With a "good" POST design, you can reassure the user that the
device *appears* to be functional. That the data/code stored in it
are intact (since last time they were accessed). That the memory
is capable of storing any values that is called on to preserve.
That the hardware I/Os can control and sense as intended, etc.
/But, you have no guarantee that this condition will persist!/
If it WAS guaranteed to persist, then the simple way to make high
reliability devices would be just to /never turn them off/ to
take advantage of this "guarantee"!
Everything here is domain specific. In cheap MCU-based device main
source of failurs is overvoltage/ESD on MCU pins. This may
kill the whole chip in which case no software protection can
help. Or some pins fail, sometimes this may be detected by reading
appropiate port. If you control electic motor then you probably
do not want to sent test signals during normal motor operation.
But you are likely to have some feedback and can verify if feedback
agrees with expected values. If you get unexpected readings
you probably will stop the motor.
-- Waldek Hebisch