Path: ...!feeds.phibee-telecom.net!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Don Y Newsgroups: sci.electronics.design Subject: Re: DRAM accommodations Date: Tue, 17 Sep 2024 07:57:39 -0700 Organization: A noiseless patient Spider Lines: 49 Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Tue, 17 Sep 2024 16:57:44 +0200 (CEST) Injection-Info: dont-email.me; posting-host="32ff6696ce38bb0c4624127f2057158e"; logging-data="3752527"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+j10sqKEw9oxzdF0cT69l1" User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.2.2 Cancel-Lock: sha1:BhidSgnXaL6QnInT1gAFi0albiM= Content-Language: en-US In-Reply-To: Bytes: 3466 On 9/17/2024 6:47 AM, Chris Jones wrote: > On 6/09/2024 8:54 am, Don Y wrote: >> Given the high rate of memory errors in DRAM, what steps >> are folks taking to mitigate the effects of these? >> >> Or, is ignorance truly bliss?  > > Do we know whether DRAM chips implement ECC internally? Some do (some processors implement ECC on internal data pathways!). But, I've never seen any details of the mechanism(s) employed and it's not likely that manufacturers would be eager to release those details (competitive advantage, leaks information about how good their process is, how close to their technological capacity they are operating, etc.). > It seems an obvious > thing for them to do. Of course it wouldn't help with bad solder joints on the > DIMM, but it would help with many kinds of faults on the chip. It also won't help with transfers between CPU and memory device, subtle timing errors in the implementation, etc. But, you have to remember: ECC isn't a panacea. - It doesn't correct *all* errors (e.g., original SECDED just corrected a single bit error) - It can MIScorrect errors - It doesn't DETECT all errors (e.g., it only reliably detects TWO errors; for k-bit data, there will be 2^k code words that appear "correct" -- a number identical to the actual number of code words that *are* correct! -- yet have UNDETECTABLE errors), etc. There is also often a cost to the ECC operation in terms of time, power consumption, etc. And, if you hide the functioning of the ECC inside the memory device, then the application has no way of gauging how well the memory is performing with/without the ECC functionality! You never know if the ECC is only occasionally fixing stored data OR if it is fixing EVERY access! (in the latter case, one should be wary of the number of mistakes it is possibly making as well as the number of undetectable errors that are slipping past it!) Needless to say, there is a lot of research into alternative ECC schemes that try to address different aspects of DRAM faults and failures. But, naively expecting DRAM to store what you write to it is a fairy tale. So, you should have, in place, a strategy to address those likely failures in your product design (or, just blame it on "the software" :> )