Sujet : Re: Diagnostics
De : antispam (at) *nospam* fricas.org (Waldek Hebisch)
Groupes : comp.arch.embeddedDate : 19. Oct 2024, 15:53:33
Autres entêtes
Organisation : To protect and to server
Message-ID : <vf0dkr$1t54l$1@paganini.bofh.team>
References : 1 2 3 4 5
User-Agent : tin/2.6.2-20221225 ("Pittyvaich") (Linux/6.1.0-9-amd64 (x86_64))
Don Y <
blockedofcourse@foo.invalid> wrote:
On 10/18/2024 8:00 PM, Waldek Hebisch wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
On 10/18/2024 1:30 PM, Waldek Hebisch wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
There, runtime diagnostics are the only alternative for hardware
revalidation, PFA and diagnostics.
>
How commonly are such mechanisms implemented? And, how thoroughly?
>
This is strange question. AFAIK automatically run diagnostics/checks
are part of safety regulations.
>
Not all devices are covered by "regulations".
Well, if device matters then there is implied liabilty
and nobody want to admit doing bad job. If device
does not matter, then answer to the original question
also does not matter.
In the US, ANYTHING can result in a lawsuit. But, "due diligence"
can insulate the manufacturer, to some extent. No one ever
*admits* to "doing a bad job".
If your doorbell malfunctions, what "damages" are you going
to claim? If your garage door doesn't open when commanded?
If your yard doesn't get watered? If you weren't promptly
notified that the mail had just been delivered? Or, that
the compressor in the freezer had failed and your foodstuffs
had spoiled, as a result?
The costs of litigation are reasonably high. Lawyers want
to see the LIKELIHOOD of a big payout before entertaining
such litigation.
Each item above may contribute to a significant loss. And
there could push to litigation (say by a consumer advocacy group)
basically to establish a precedent. So, better have
record of due diligence.
And, the *extent* to which testing is done is the subject
addressed; if I ensure "stuff" *WORKED* when the device was
powered on (preventing it from continuing on to its normal
functionality in the event that some failure was detected),
what assurance does that give me that the device's integrity
is still intact 8760 hours (1 yr) hours later? 720 hours
(1 mo)? 168 hours (1 wk)? 24 hours? *1* hour????
What to test is really domain-specific. Traditional thinking
is that computer hardware is _much_ more reliable than
software and software bugs are major source of misbehaviour.
That hasn't been *proven*. And, "misbehavior" is not the same
as *failure*.
First, I mean relevant hardware, that is hardware inside a MCU.
I think that there are strong arguments that such hardware is
more reliable than software. I have seen claim based on analysis
of discoverd failures that software written to rigorous development
standars exhibits on average about 1 bug (that lead to failure) per
1000 lines of code. This means that evan small MCU has enough
space of handful of bugs. And for bigger systems it gets worse.
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure. For example,
That used to be the thinking with DRAM but studies have shown
that *hard* failures are more common. These *can* be found...
*if* you go looking for them!
I another place I wrote the one of studies that I saw claimed that
significant number of errors they detected (they monitored changes
to a memory area that was supposed to be unmodifed) was due to buggy
software. And DRAM is special.
E.g., if you load code into RAM (from FLASH) for execution,
are you sure the image *in* the RAM is the image from the FLASH?
What about "now"? And "now"?!
You are supposed to regularly verify sufficiently strong checksum.
at low safety level you may assume that hardware of a counter
generating PWM-ed signal works correctly, but you are
supposed to periodically verify that configuration registers
keep expected values.
Why would you expect the registers to lose their settings?
Would you expect the CPUs registers to be similarly flakey?
First, such checking is not my idea, but one point from checklist for
low safety devices. Registers may change due to bugs, EMC events,
cosmic rays and similar.
Historically OS-es had a map of bad blocks on the disc and
avoided allocating them. In principle on system with paging
hardware the same could be done for DRAM, but I do not think
anybody is doing this (if domain is serious enough to worry
about DRAM failures, then it probaly have redundant independent
computers with ECC DRAM).
Using ECC DRAM doesn't solve the problem. If you see errors
reported by your ECC RAM (corrected errors), then when do
you decide you are seeing too many and losing confidence that
the ECC is actually *detecting* all multibit errors?
ECC is part of solution. It may reduce probability of error
so that you consider them not serious enough. And if you
really care you may try to increase error rate (say by putting
RAM chips at increased temperature) and test that your detection
and recovery strategy works OK.
-- Waldek Hebisch