Sujet : Re: Diagnostics
De : blockedofcourse (at) *nospam* foo.invalid (Don Y)
Groupes : comp.arch.embeddedDate : 19. Oct 2024, 00:15:30
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <veummc$3gbqs$1@dont-email.me>
References : 1 2
User-Agent : Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.2.2
On 10/18/2024 1:30 PM, Waldek Hebisch wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
Typically, one performs some limited "confidence tests"
at POST to catch gross failures. As this activity is
"in series" with normal operation, it tends to be brief
and not very thorough.
>
Many products offer a BIST capability that the user can invoke
for more thorough testing. This allows the user to decide
when he can afford to live without the normal functioning of the
device.
>
And, if you are a "robust" designer, you often include invariants
that verify hardware operations (esp to I/Os) are actually doing
what they should -- e.g., verifying battery voltage increases
when you activate the charging circuit, loopbacks on DIOs, etc.
>
But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
And, BIST might not always be convenient (as well as requiring the
user's consent and participation).
>
There, runtime diagnostics are the only alternative for hardware
revalidation, PFA and diagnostics.
>
How commonly are such mechanisms implemented? And, how thoroughly?
This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations.
Not all devices are covered by "regulations".
And, the *extent* to which testing is done is the subject
addressed; if I ensure "stuff" *WORKED* when the device was
powered on (preventing it from continuing on to its normal
functionality in the event that some failure was detected),
what assurance does that give me that the device's integrity
is still intact 8760 hours (1 yr) hours later? 720 hours
(1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????
[I.e., how long a device remains "up" is a function of the device,
it's application, environment and user]
Do you just *hope* the device "happens" to fail in a noticeable
manner so a user is left with no doubt but that the device is
no longer operational?
Even if some safety critical software
does not contain them, nobody is going to admit violationg regulations.
And things like PLC-s are "dual use", they may be used in non-safety
role, but vendors claim compliance to safety standards.
So, if a bit in a RAM in said device *dies* some time after power on,
is the device going to *know* that has happened? And, signal its
unwillingness to continue operating? What is going to detect that
failure?
What if the bit's failure is inconsequential to the operation
of the device? E.g., if the bit is part of some not-used
feature? *Or*, if it has failed in the state it was *supposed*
to be in??!
With a "good" POST design, you can reassure the user that the
device *appears* to be functional. That the data/code stored in it
are intact (since last time they were accessed). That the memory
is capable of storing any values that is called on to preserve.
That the hardware I/Os can control and sense as intended, etc.
/But, you have no guarantee that this condition will persist!/
If it WAS guaranteed to persist, then the simple way to make high
reliability devices would be just to /never turn them off/ to
take advantage of this "guarantee"!