On 10/19/2024 12:25 PM, George Neuner wrote:
Same ol', same ol'. Nothing much new to report.
No news is good news!
On 10/18/2024 2:42 PM, George Neuner wrote:
But, if you *know* when certain aspects of a device will be "called on",
you can take advantage of that to schedule diagnostics when the device is
not "needed". And, in the event that some unexpected "need" arises,
can terminate or suspend the testing (possibly rendering the effort
moot if it hasn't yet run to a conclusion).
If you "know" a priori when some component will be needed, then you
can do whatever you want when it is not. The problem is that many
uses can't be easily anticipated.
Granted, I can't know when a user *might* want to do some
asynchronous task. But, the whole point of my system is to
watch and anticipate needs based on observed behaviors.
E.g., if the house is unoccupied, then its not likely that
anyone will want to watch TV -- unless they have *scheduled*
a recording of a broadcast (in which case, I would know it).
If the occupants are asleep, then its not likely they will be
going out for a drive.
Which circles back to testing priority: if the test is interruptible
and/or resumeable, then it may be done whenever the component is
available ... as long as it won't tie up the component if and when it
becomes needed for something else.
Exactly. I already have to deal with that in my decisions to
power down nodes. If my actions are incorrect, then it introduces
a delay in getting "back" to whatever state I should have been in.
E.g., I scrub freed memory pages (zero fill) so information doesn't
leak across protection domains. As long as some minimum number
of *scrubbed* pages are available for use "on demand", why can't
I *test* the pages yet to be scrubbed?
If you're testing memory pages, most likely you are tying up bandwidth
in the memory system and slowing progress of the real applications.
But, they wouldn't be scrubbed if there were higher "priority"
tasks demanding resources. I.e., some other "lower priority"
task would have been accessing memory.
Also because you can't accurately judge the "minimum" needed. BSD and
Linux both have this problem where a sudden burst of allocations
exhausts the pool of zeroed pages, forcing demand zeroing of new pages
prior to their re-assignment. Slows the system to a crawl when it
happens.
Yes, but you have live users arbitrarily deciding they "need" those
resources. And, have considerably more pages at risk for use.
I've only got ~1G per node and (theoretically), a usage model of
what resources are needed, when (where).
*Not* clearing the pages leaves a side channel open for information
leakage so *that* isn't negotiable. Having some "deliberately
dirty" could be an issue but, even "dirty", they are wiped of
their previous contents after a single pass through the test.
If there is no anticipated short term need for irrigation, why
can't I momentarily activate individual valves and watch to see that
the expected amount of water is flowing?
Because then you are watering (however briefly) when it is not
expected. What if there was a pesticide application that should not
be wetted? What if a person is there and gets sprayed by your test?
Irrigation, here, is not airborne. The ground may be wetted in the
*immediate* vicinity of the emitters activated. But, they operate at
very low flow rates (liters per HOUR).
Your goal is to verify the master valve(s) operate (I do that by opening
the purge valve(s) and letting water drain into a sump); the individual
valves are operable; and that water *flows* when commanded.
Properly, valve testing should be done concurrently with a scheduled
watering. Check water is flowing when the valve should be open, and
not flowing when the valve should be closed.
That happens as part of normal operation. But, NOT knowing until that
time can lead to plant death. E.g., if the roses don't get watered twice
a day, they are toast (in this environment). If the cacti valves don't
*close*, they are toast. If a line is "failed open", then you've
a geyser in the yard (and *no* irrigation to those plants)
Repairs of this nature can be time consuming, depending on the nature
of the failure (and cost thousands of dollars in labor). The more I
can deduce about the nature of the failure, the quicker the service
can be brought back up to par and the less the "diagnostic cost"
of having someone do so, manually (digging up a yard to determine where
a line has been punctured; inspecting individual emitters to determine
which are blocked; visually monitoring for water flow per zone; etc.)
[Amazing how much these "minimum wage jobs" actually end up costing
when you have to hire someone! E.g., $160/month to have your "yard
cleaned" -- *if* you can find someone to do it at that rate! Irrigation
work starts at kilobucks and is relatively open-ended (as no one can
assess the nature of the job until they start on it)]
To ensure 100%
functionality at all times effectively requires use of redundant
hardware - which generally is too expensive for a non safety critical
device.
>
Apparently, there is noise about incorporating such hardware into
*automotive* designs (!). I would have thought the time between
POSTs would have rendered that largely ineffective. OTOH, if
you imagine a failure can occur ANY time, then "just after
putting the car in gear" is as good (bad!) a time as any!
Automotive is going the way of aircraft: standby running lockstep with
the primary and monitoring its data flow - able to reset the system if
they disagree, or take over if the primary fails.
The point here is that there is no "one fits all" philosophy you can
follow ... what is proper to do depends on what the (sub)system does,
its criticality, and on the components involved that may need to be
tested.
I am, rather, looking for ideas as to how (others) may have approached
it. Most of the research I've uncovered deals with servers and their
ilk. Or, historical information (e.g., MULTICS' "computing as a service"
philosophy). E.g., *scheduling* testing vs. opportunistic testing.