Newsportal USENET - Re: DRAM accommodations

Don Y <blockedofcourse@foo.invalid> wrote:

On 9/16/2024 5:33 PM, Waldek Hebisch wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
On 9/5/2024 3:54 PM, Don Y wrote:
Given the high rate of memory errors in DRAM, what steps
are folks taking to mitigate the effects of these?
>
Or, is ignorance truly bliss? <frown>
>
From discussions with colleagues, apparently, adding (external) ECC to
most MCUs is simply not possible; too much of the memory and DRAM
controllers are in-built (unlike older multi-chip microprocessors).
There's no easy way to generate a bus fault to rerun the bus cycle
or delay for the write-after-read correction.

Are you really writing about MCU-s?

Yes. In particular, the Cortex A series.

I do not think that Cortex A is a MCU. More precisely, Cortex A
core is not intended to be used in MCU-s and I think that most
chips using it are not MCU-s. Distingushing feature here is if
chip is intended to be a complete system (MCU) or are you expected
to _always_ use it with external chips (typeical Cortex A based SOC).

My impression was that MCU-s
which allow external memory (most do not) typically have WAIT signal
so that external memory can insert extra wait states. OTOH access
to external memory is much slower than to internal memory.

You typically use external memory when there isn't enough
internal memory for your needs. I'm looking at 1-2GB / device
(~500GB - 2TB per system)

People frequently use chips designed for gadgets like smartphones
of TV-s, those tend to have integrated DRAM controller and no
support for ECC.

Exactly. As the DRAM controller is in-built, adding ECC isn't
an option. Unless the syndrome logic is ALSO in-built (it is
on some offerings, but not all).

And, among those devices that *do* support ECC, it's just a conventional
SECDEC implelmentation. So, a fair number of UCEs will plague any
design with an appreciable amount of DRAM (can you even BUY *small*
amounts of DRAM??)

IIUC, if you need small amount of memory you should use SRAM...

As above, 1-2GB isn't small, even by today's standards.
And, SRAM isn't without its own data integrity issues.

For devices with PMMUs, it's possible to address the UCEs -- sort of.
But, this places an additional burden on the software and raises
the problem of "If you are getting UCEs, how sure are you that
undetected CEs aren't slipping through??" (again, you can only
detect the UCEs via an explicit effort so you pay the fee and take
your chances!)
>
For devices without PMMUs, you have to rely on POST or BIST. And,
*hope* that everything works in the periods between (restart often! :> )
>
Back of the napkin figures suggest many errors are (silently!) encountered
in an 8-hour shift. For XIP implementations, it's mainly data that is at
risk (though that can also include control flow information from, e.g.,
the pushdown stack). For implementations that load their application
into DRAM, then the code is suspect as well as the data!
>
[Which is likely to cause more detectable/undetectable problems?]

I think that this estimate is somewhat pessimistic. The last case
I remember that could be explanied by memory error was about 25 years
ago: I unpacked source of newer gcc version and tried to compile it.
Compilation failed, I tracked trouble to a flipped bit in one source
file. Unpacking sources again gave correct value of affected bit
(and on other bit chaged). And compiling the second copy worked OK.

You were likely doing this on a PC/workstation, right? Nowadays,
having ECC *in* the PC is commonplace.

PC-s, yes (small SBC-s too, but those were not used for really heavy
computations). Concerning ECC, most PC-s that I used came without ECC.
IME ECC used to roughly double price of the PC compared to not ECC
one. So, important servers got ECC, other PC-s were non ECC.

"We made two key observations: First, although most personal-computer
users blame system failures on software problems (quirks of the
operating system, browser, and so forth) or maybe on malware infections,
--> hardware was the main culprit. At Los Alamos, for instance, more than
60 percent of machine outages came from hardware issues. Digging further,
we found that the most common hardware problem was faulty DRAM. This
meshes with the experience of people operating big data centers, DRAM
modules being among the most frequently replaced components."

I remeber memory corruption study several years ago that said that
software was significant issue. In particular bugs in BIOS and Linux
kernel led to random looking memory corruption. Hopefully, issues
that they noticed are fixed now. Exact precentages probably do
not matter much, because both hardware and software is changing.
The point is that there are both hardware errors and software errors
which without deep investigation are hard to distinguish from
hardware ones.

Earlier I saw segfaults that could be cured by exchanging DRAM
modules or downclocking the machine.

.. suggesting a memory integrity issue.

Exactly.

But it seems that machines got more reliable and I did not remember
any recent problem like this. And I regularly do large compiles,
here error in sources is very unlikely to go unnoticed. I did

But, again, the use case, in a workstation, is entirely different
than in a "device" where the code is unpacked (from NAND flash
or from a remote file server) into RAM and then *left* there to
execute for the duration of the device's operation (hours, days,
weeks, etc.). I.e., the DRAM is used to emulate EPROM.

In a PC, when your application is "done", the DRAM effectively is
scrubbed by the loading of NEW data/text. This tends not to be
the case for appliances/devices; such a reload only tends to happen
when the power is cycled and the contents of (volatile) DRAM are
obviously lost.

Well, most of my computations were on machines without ECC. And
I also had some machines sitting idle for long time. They had
cached data in RAM and would use it when given some work to do.

Even without ECC, not all errors are consequential. If a bit
flips and is never accessed, <shrug>. If a bit flips and
it alters one opcode into another that is equivalent /in the
current program state/, <shrug> Ditto for data.

Yes. In numerics using Newton style iteration small number of
bit flips normally means that it needs to iterate longer, but will
still converge to the correct result.

large computations were any error had nontrivial chance to propagate
to final result. Some computations that I do are naturally error
tolerant, but error tolerant part used tiny amount of data, while
most data was "fragile" (error there was likely to be detected).

If you are reaccessing the memory (data or code) frequently, you
give the ECC a new chance to "refresh" the intended contents of
that memory (assuming your ECC hardware is configured for write-back
operation).

As I wrote, most computations were on non-ECC machines.

So, the possibility of a second fault coming along
while the first fault is still in place (and uncorrected) is small.

OTOH, if the memory just sits there with the expectation that it
will retain its intended contents without a chance to be corrected
(by the ECC hardware), then bit-rot can continue increasing the
possibility of a second bit failing while the first is still failed.

Well, I you are concerned you can implement low priority process
that will read RAM possibly doing some extra work (like detecting
unexpected changes).

Remember, you don't even consider this possibility when you are
executing out of EPROM or NOR FLASH... you just assume bits
retain their state even when you aren't "looking at them".

Concerning doing something about memory errors: on hardwares side
devices with DRAM that I use are COTS devices. So the only thing
I have is to look at reputation of the vendor, and general reputation
says nothing about DRAM errors. In other words, there is nothing
I can realistically do.

You can select MCUs that *do* have support for ECC instead of just
"hoping" the (inevitable) memory errors won't bite you. Even so,
your best case scenario is just SECDED protection.

When working with MCU-s I have internal SRAM instead of DRAM.
And in several cases manufacturers claim parity or ECC protection
for SRAM. But in case of PC-s hardware and OS are commodity.
Due to price reasons I mostly deal with non-ECC hardware.

On software side I could try to add
some extra error tolerance. But I have various consistency checks
and they indicate various troubles. I deal with detected troubles,
DRAM errors are not in this category.

I assume you *test* memory on POST?

On PC-s that is BIOS and OS that I did not wrote. And AFAIK BIOS
POST is detecting memory size and doing a little sanity checks
to detect gross troubles. But I am not sure if I would call them
"memory tests", better memory tests tend to run for days.

But, if your device runs
24/7/365, it can be a very long time between POSTs! OTOH, you
could force a test cycle (either by deliberately restarting
your device -- "nightly maintenance") or, you could test memory
WHILE the application is running.

And, what specific criteria do you use to get alarmed at the results??

As I wrote I am doing computations and criteria are problem specific.
For example I have two multidigit numbers and one is supposed to
exactly divide the other. If not, software signals an error.

I read several papers about DRAM errors and I take seriously
possiblity that they can happen. But simply at my scale they
do not seem to matter.

BTW: It seems that currently large fraction of errors (both software
and hardware ones) appear semi-randomly. So, to estimate
reliabilty one should use statistic methods. But if you aim
at high reliablity, then needed sample size may be impractically
large. You may be able to add mitigations for rare problems that
you can predict/guess, but you will be left with unpredictable
ones. In other words, it make sense to concentrate on problems
that you see (you including your customers). AFAIK some big
companies have wordwide automatic error reporting systems.
If you set up such a thing that you may get useful info.

You can try "error injection", that is run tests with extra
component that simulates memory errors. Then you will have
some info about effects:
- do memory errors cause incorrect operation?
- is incorrect operation detected?

--
Waldek Hebisch

Date	Sujet	#	Auteur
5 Sep 24	DRAM accommodations	18	Don Y
6 Sep 24	Re: DRAM accommodations	1	john larkin
6 Sep 24	Re: DRAM accommodations	1	Bill Sloman
6 Sep 24	Re: DRAM accommodations	13	Don Y
7 Sep 24	Re: DRAM accommodations	1	Don Y
7 Sep 24	Re: DRAM accommodations	1	Bill Sloman
7 Sep 24	Re: DRAM accommodations	5	Don Y
7 Sep 24	Re: DRAM accommodations	2	Don Y
7 Sep 24	Re: DRAM accommodations	1	Don Y
7 Sep 24	Re: DRAM accommodations	1	Don Y
7 Sep 24	Re: DRAM accommodations	1	john larkin
17 Sep 24	Re: DRAM accommodations	5	Waldek Hebisch
17 Sep 24	Re: DRAM accommodations	4	Don Y
17 Sep 24	Re: DRAM accommodations	1	Don Y
17 Sep 24	Re: DRAM accommodations	2	antispam
17 Sep 24	Re: DRAM accommodations	1	Don Y
17 Sep 24	Re: DRAM accommodations	1	Bill Sloman
17 Sep 24	Re: DRAM accommodations	1	Don Y