Sujet : Re: DRAM accommodations
De : blockedofcourse (at) *nospam* foo.invalid (Don Y)
Groupes : sci.electronics.designDate : 07. Sep 2024, 12:04:37
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vbhc0g$1bjbi$1@dont-email.me>
References : 1 2 3
User-Agent : Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.2.2
On 9/7/2024 2:56 AM,
albert@spenarnc.xs4all.nl wrote:
Back of the napkin figures suggest many errors are (silently!) encountered
in an 8-hour shift. For XIP implementations, it's mainly data that is at
risk (though that can also include control flow information from, e.g.,
the pushdown stack). For implementations that load their application
into DRAM, then the code is suspect as well as the data!
>
[Which is likely to cause more detectable/undetectable problems?]
Running many day long computations for e.g. the euler project,
involving giga byte memories, and require precise (not one off
results),
I have not encountered any wrong results caused by RAM failures.
Do you KNOW that there haven't been any that your ECC *hasn't* silently
corrected for you? Or, that any uncorrected errors have tickled
vulnerabilities in your data/code?
[I've seen at least one study that deliberately injected memory faults
into running OSs in an attempt to determine how "resilient" they were
to such faults. If a conditional jump is replaced by an unconditional
jump (because of a corrupted opcode), you wouldn't notice a difference
/if the condition was true/! Likewise, data can be altered in ways that
are masked by the operations/tests performed on them. I.e., without
actively monitoring the ECC hardware, you are largely clueless as to
what is really happening in the memory]
Real-world studies show FITs of 1,000 - 70,000 / Mb. So, 64,000,000 / GB.
For a small 4GB machine, that's a failure every 4 hours.
Chances are, you are reloading an application and, thus, refreshing the values
in the DRAM -- TEXT and DATA. Let your application run for a week or two
(with const data) and see if it *provably* exhibits no errors -- correctable
or uncorrectable.
My machines are up 24/7 for months at a time. *But*, the contents of DRAM are
continuously being altered/updated. This effectively amounts to a scrubbing
operation. So, the chance of a datum being noticeably corrupt are reduced.
OTOH, loading code into DRAM (e.g., from NAND FLASH) and letting it sit there
/as if/ it was ROM leaves it vulnerable to bit-rot unless ALL of the code is
continuously reread (re-executed). And, single bit errors degrade to multiple
bit errors -- which SECDED won't address.
Do you *know* what happens in your OS when an ECC error is detected/corrected?