Sujet : Re: DRAM accommodations
De : antispam (at) *nospam* fricas.org (Waldek Hebisch)
Groupes : sci.electronics.designDate : 17. Sep 2024, 01:33:31
Autres entêtes
Organisation : To protect and to server
Message-ID : <vcaiop$makp$1@paganini.bofh.team>
References : 1 2
User-Agent : tin/2.6.2-20221225 ("Pittyvaich") (Linux/6.1.0-9-amd64 (x86_64))
Don Y <
blockedofcourse@foo.invalid> wrote:
On 9/5/2024 3:54 PM, Don Y wrote:
Given the high rate of memory errors in DRAM, what steps
are folks taking to mitigate the effects of these?
Or, is ignorance truly bliss? <frown>
From discussions with colleagues, apparently, adding (external) ECC to
most MCUs is simply not possible; too much of the memory and DRAM
controllers are in-built (unlike older multi-chip microprocessors).
There's no easy way to generate a bus fault to rerun the bus cycle
or delay for the write-after-read correction.
Are you really writing about MCU-s? My impression was that MCU-s
which allow external memory (most do not) typically have WAIT signal
so that external memory can insert extra wait states. OTOH access
to external memory is much slower than to internal memory.
People frequently use chips designed for gadgets like smartphones
of TV-s, those tend to have integrated DRAM controller and no
support for ECC.
And, among those devices that *do* support ECC, it's just a conventional
SECDEC implelmentation. So, a fair number of UCEs will plague any
design with an appreciable amount of DRAM (can you even BUY *small*
amounts of DRAM??)
IIUC, if you need small amount of memory you should use SRAM...
For devices with PMMUs, it's possible to address the UCEs -- sort of.
But, this places an additional burden on the software and raises
the problem of "If you are getting UCEs, how sure are you that
undetected CEs aren't slipping through??" (again, you can only
detect the UCEs via an explicit effort so you pay the fee and take
your chances!)
For devices without PMMUs, you have to rely on POST or BIST. And,
*hope* that everything works in the periods between (restart often! :> )
Back of the napkin figures suggest many errors are (silently!) encountered
in an 8-hour shift. For XIP implementations, it's mainly data that is at
risk (though that can also include control flow information from, e.g.,
the pushdown stack). For implementations that load their application
into DRAM, then the code is suspect as well as the data!
[Which is likely to cause more detectable/undetectable problems?]
I think that this estimate is somewhat pessimistic. The last case
I remember that could be explanied by memory error was about 25 years
ago: I unpacked source of newer gcc version and tried to compile it.
Compilation failed, I tracked trouble to a flipped bit in one source
file. Unpacking sources again gave correct value of affected bit
(and on other bit chaged). And compiling the second copy worked OK.
Earlier I saw segfaults that could be cured by exchanging DRAM
modules or downclocking the machine.
But it seems that machines got more reliable and I did not remember
any recent problem like this. And I regularly do large compiles,
here error in sources is very unlikely to go unnoticed. I did
large computations were any error had nontrivial chance to propagate
to final result. Some computations that I do are naturally error
tolerant, but error tolerant part used tiny amount of data, while
most data was "fragile" (error there was likely to be detected).
Concerning doing something about memory errors: on hardwares side
devices with DRAM that I use are COTS devices. So the only thing
I have is to look at reputation of the vendor, and general reputation
says nothing about DRAM errors. In other words, there is nothing
I can realistically do. On software side I could try to add
some extra error tolerance. But I have various consistency checks
and they indicate various troubles. I deal with detected troubles,
DRAM errors are not in this category.
-- Waldek Hebisch