Sujet : Re: Diagnostics
De : blockedofcourse (at) *nospam* foo.invalid (Don Y)
Groupes : comp.arch.embeddedDate : 23. Oct 2024, 14:53:44
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vfarku$21lks$1@dont-email.me>
References : 1 2 3 4 5 6
User-Agent : Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.2.2
On 10/19/2024 2:32 PM, Don Y wrote:
The point here is that there is no "one fits all" philosophy you can
follow ... what is proper to do depends on what the (sub)system does,
its criticality, and on the components involved that may need to be
tested.
I am, rather, looking for ideas as to how (others) may have approached
it. Most of the research I've uncovered deals with servers and their
ilk. Or, historical information (e.g., MULTICS' "computing as a service"
philosophy). E.g., *scheduling* testing vs. opportunistic testing.
"Opportunistic" seems to work well -- *if* you declare the resources
you will need and wait until you can acquire them.
The downside is that you may NEVER be able to acquire them,
based on what processes are active on a node. You wouldn't want
the diagnostic task to have to KNOW those things!
As different tests may require different resources, this
becomes problematic; do you request the largest set? A
smaller set? Or, design a mechanism to allow for arbitrarily
complex combinations to be specified <frown>
This became apparent when running the DRAM test using the
DRAM emulator (non-production board designed to validate the
DRAM test by allowing arbitrary fault injection, on demand).
While it was known that *some* tests could NOT be run out of
DRAM (which limits their efficacy in a running system), there
were other system resources that were "silently" called upon
that would have impacted other coexecuting tasks. <frown>
The good news (wrt DRAM testing) is that checking for "stuck at"
faults -- the most prevalent described in published research -- makes
no special needs for resources, beyond access to DRAM!
Moral of story: CAREFULLY enumerate (and declare) ALL such
resources. And, consider how realistic it is to expect
ALL of them to be available serendipitously in a given node.
Else, resort to *scheduling* the diagnostic ("maintenance period")