Deutsch English Français Italiano |
<vcao57$3a145$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Don Y <blockedofcourse@foo.invalid> Newsgroups: sci.electronics.design Subject: Re: DRAM accommodations Date: Mon, 16 Sep 2024 19:05:24 -0700 Organization: A noiseless patient Spider Lines: 196 Message-ID: <vcao57$3a145$1@dont-email.me> References: <vbdcrs$gp01$1@dont-email.me> <vbflbe$tlhp$7@dont-email.me> <vcaiop$makp$1@paganini.bofh.team> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Tue, 17 Sep 2024 04:05:28 +0200 (CEST) Injection-Info: dont-email.me; posting-host="32ff6696ce38bb0c4624127f2057158e"; logging-data="3474565"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/rXvYewrXPJvMm1TpBldlM" User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.2.2 Cancel-Lock: sha1:HDq1LJdkpHlNOog2nb/Zrmybkx4= In-Reply-To: <vcaiop$makp$1@paganini.bofh.team> Content-Language: en-US Bytes: 10527 On 9/16/2024 5:33 PM, Waldek Hebisch wrote: > Don Y <blockedofcourse@foo.invalid> wrote: >> On 9/5/2024 3:54 PM, Don Y wrote: >>> Given the high rate of memory errors in DRAM, what steps >>> are folks taking to mitigate the effects of these? >>> >>> Or, is ignorance truly bliss? <frown> >> >> From discussions with colleagues, apparently, adding (external) ECC to >> most MCUs is simply not possible; too much of the memory and DRAM >> controllers are in-built (unlike older multi-chip microprocessors). >> There's no easy way to generate a bus fault to rerun the bus cycle >> or delay for the write-after-read correction. > > Are you really writing about MCU-s? Yes. In particular, the Cortex A series. > My impression was that MCU-s > which allow external memory (most do not) typically have WAIT signal > so that external memory can insert extra wait states. OTOH access > to external memory is much slower than to internal memory. You typically use external memory when there isn't enough internal memory for your needs. I'm looking at 1-2GB / device (~500GB - 2TB per system) > People frequently use chips designed for gadgets like smartphones > of TV-s, those tend to have integrated DRAM controller and no > support for ECC. Exactly. As the DRAM controller is in-built, adding ECC isn't an option. Unless the syndrome logic is ALSO in-built (it is on some offerings, but not all). >> And, among those devices that *do* support ECC, it's just a conventional >> SECDEC implelmentation. So, a fair number of UCEs will plague any >> design with an appreciable amount of DRAM (can you even BUY *small* >> amounts of DRAM??) > > IIUC, if you need small amount of memory you should use SRAM... As above, 1-2GB isn't small, even by today's standards. And, SRAM isn't without its own data integrity issues. >> For devices with PMMUs, it's possible to address the UCEs -- sort of. >> But, this places an additional burden on the software and raises >> the problem of "If you are getting UCEs, how sure are you that >> undetected CEs aren't slipping through??" (again, you can only >> detect the UCEs via an explicit effort so you pay the fee and take >> your chances!) >> >> For devices without PMMUs, you have to rely on POST or BIST. And, >> *hope* that everything works in the periods between (restart often! :> ) >> >> Back of the napkin figures suggest many errors are (silently!) encountered >> in an 8-hour shift. For XIP implementations, it's mainly data that is at >> risk (though that can also include control flow information from, e.g., >> the pushdown stack). For implementations that load their application >> into DRAM, then the code is suspect as well as the data! >> >> [Which is likely to cause more detectable/undetectable problems?] > > I think that this estimate is somewhat pessimistic. The last case > I remember that could be explanied by memory error was about 25 years > ago: I unpacked source of newer gcc version and tried to compile it. > Compilation failed, I tracked trouble to a flipped bit in one source > file. Unpacking sources again gave correct value of affected bit > (and on other bit chaged). And compiling the second copy worked OK. You were likely doing this on a PC/workstation, right? Nowadays, having ECC *in* the PC is commonplace. "We made two key observations: First, although most personal-computer users blame system failures on software problems (quirks of the operating system, browser, and so forth) or maybe on malware infections, --> hardware was the main culprit. At Los Alamos, for instance, more than 60 percent of machine outages came from hardware issues. Digging further, we found that the most common hardware problem was faulty DRAM. This meshes with the experience of people operating big data centers, DRAM modules being among the most frequently replaced components." > Earlier I saw segfaults that could be cured by exchanging DRAM > modules or downclocking the machine. ... suggesting a memory integrity issue. > But it seems that machines got more reliable and I did not remember > any recent problem like this. And I regularly do large compiles, > here error in sources is very unlikely to go unnoticed. I did But, again, the use case, in a workstation, is entirely different than in a "device" where the code is unpacked (from NAND flash or from a remote file server) into RAM and then *left* there to execute for the duration of the device's operation (hours, days, weeks, etc.). I.e., the DRAM is used to emulate EPROM. In a PC, when your application is "done", the DRAM effectively is scrubbed by the loading of NEW data/text. This tends not to be the case for appliances/devices; such a reload only tends to happen when the power is cycled and the contents of (volatile) DRAM are obviously lost. Even without ECC, not all errors are consequential. If a bit flips and is never accessed, <shrug>. If a bit flips and it alters one opcode into another that is equivalent /in the current program state/, <shrug> Ditto for data. > large computations were any error had nontrivial chance to propagate > to final result. Some computations that I do are naturally error > tolerant, but error tolerant part used tiny amount of data, while > most data was "fragile" (error there was likely to be detected). If you are reaccessing the memory (data or code) frequently, you give the ECC a new chance to "refresh" the intended contents of that memory (assuming your ECC hardware is configured for write-back operation). So, the possibility of a second fault coming along while the first fault is still in place (and uncorrected) is small. OTOH, if the memory just sits there with the expectation that it will retain its intended contents without a chance to be corrected (by the ECC hardware), then bit-rot can continue increasing the possibility of a second bit failing while the first is still failed. Remember, you don't even consider this possibility when you are executing out of EPROM or NOR FLASH... you just assume bits retain their state even when you aren't "looking at them". > Concerning doing something about memory errors: on hardwares side > devices with DRAM that I use are COTS devices. So the only thing > I have is to look at reputation of the vendor, and general reputation > says nothing about DRAM errors. In other words, there is nothing > I can realistically do. You can select MCUs that *do* have support for ECC instead of just "hoping" the (inevitable) memory errors won't bite you. Even so, your best case scenario is just SECDED protection. > On software side I could try to add > some extra error tolerance. But I have various consistency checks > and they indicate various troubles. I deal with detected troubles, > DRAM errors are not in this category. I assume you *test* memory on POST? But, if your device runs 24/7/365, it can be a very long time between POSTs! OTOH, you could force a test cycle (either by deliberately restarting your device -- "nightly maintenance") or, you could test memory WHILE the application is running. And, what specific criteria do you use to get alarmed at the results?? <https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf> <https://www.cse.cuhk.edu.hk/~pclee/www/pubs/srds22.pdf> <https://core.ac.uk/download/pdf/81577401.pdf> [Lots more where those come from. Seeing *observed* data instead of theoretical is eye-opening! Is a server environment more pandering of DRAM? Or, *less* than, for example, a device that is controlling a large electro-mechanical load and the switching/mechanical transients that it might induce?] Of course, only systems with lots of memory available to test are good candidates for such testing -- hence the nature of these studies. Do systems with soldered down memory behave better than those with socketed devices? (How many insertion cycles are your sockets rated for? How many have been *used*???) Is an 80F degree cold aisle with lots of forced air cooling better or worse than a 100F ambient with convection cooling? Is a server secured to an equipment rack a more hospitable environment than one in which the device is feeling the effects of 10T shocks at 200Hz? But, error *rates* can be extrapolated to an arbitrary memory size. E.g., the google study (old, now) noted FITs of 25,000-75,000 per Mb. This is on the order of ========== REMAINDER OF ARTICLE TRUNCATED ==========