Article <vccjfp$3lgt3$2@dont-email.me>

Deutsch English Français Italiano
<vccjfp$3lgt3$2@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Don Y <blockedofcourse@foo.invalid>
Newsgroups: sci.electronics.design
Subject: Re: DRAM accommodations
Date: Tue, 17 Sep 2024 11:57:56 -0700
Organization: A noiseless patient Spider
Lines: 191
Message-ID: <vccjfp$3lgt3$2@dont-email.me>
References: <vbdcrs$gp01$1@dont-email.me> <vbflbe$tlhp$7@dont-email.me>
 <vcaiop$makp$1@paganini.bofh.team> <vcao57$3a145$1@dont-email.me>
 <vcc9q6$p2mo$1@paganini.bofh.team>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 17 Sep 2024 20:58:01 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="32ff6696ce38bb0c4624127f2057158e";
	logging-data="3851171"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18u9fmTxMl6pbB7/F0cGRi4"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.2.2
Cancel-Lock: sha1:jqO04WqUXKr4lX1iS6+eqdW7sO4=
In-Reply-To: <vcc9q6$p2mo$1@paganini.bofh.team>
Content-Language: en-US
Bytes: 10258

On 9/17/2024 9:12 AM, antispam@fricas.org wrote:
>>> Are you really writing about MCU-s?
>>
>> Yes.  In particular, the Cortex A series.
> 
> I do not think that Cortex A is a MCU.  More precisely, Cortex A
> core is not intended to be used in MCU-s and I think that most
> chips using it are not MCU-s.  Distingushing feature here is if
> chip is intended to be a complete system (MCU) or are you expected
> to _always_ use it with external chips (typeical Cortex A based SOC).

An MCU integrates peripherals into the same package as the CPU.
By contrast, earlier devices (MPUs) had a family of peripherals
often assembled around the CPU/MPU.

An 8051 is an MCU.  Yet, it can support external (static) memory.

The A-series devices support LCD displays, ethernet, serial ports,
counter/timers/PWM, internal static memory and FLASH as well as
controllers for external static and dynamic memory.  One can build a
product with or without the addition of external memory (e.g., the
SAMA5D34 has 160KB of ROM, 128KB of SRAM, 32K of i/d cache as well
as an assortment of peripherals that would typically be EXTERNAL
to an MPU-based design).

SoC's differ in that they are *intended* to be complete systems
(yet typically *still* off-board the physical SDRAM).

At the other end of the spectrum are discrete CPUs -- requiring
external support chips to implement *anything* of use.

>>> I think that this estimate is somewhat pessimistic.  The last case
>>> I remember that could be explanied by memory error was about 25 years
>>> ago: I unpacked source of newer gcc version and tried to compile it.
>>> Compilation failed, I tracked trouble to a flipped bit in one source
>>> file.  Unpacking sources again gave correct value of affected bit
>>> (and on other bit chaged).  And compiling the second copy worked OK.
>>
>> You were likely doing this on a PC/workstation, right?  Nowadays,
>> having ECC *in* the PC is commonplace.
> 
> PC-s, yes (small SBC-s too, but those were not used for really heavy
> computations).  Concerning ECC, most PC-s that I used came without ECC.
> IME ECC used to roughly double price of the PC compared to not ECC
> one.  So, important servers got ECC, other PC-s were non ECC.

If you don't have ECC memory, hardware AND OS, you can never tell if
you are having memory errors.

>>      "We made two key observations: First, although most personal-computer
>>      users blame system failures on software problems (quirks of the
>>      operating system, browser, and so forth) or maybe on malware infections,
>> --> hardware was the main culprit. At Los Alamos, for instance, more than
>>      60 percent of machine outages came from hardware issues. Digging further,
>>      we found that the most common hardware problem was faulty DRAM. This
>>      meshes with the experience of people operating big data centers, DRAM
>>      modules being among the most frequently replaced components."
> 
> I remeber memory corruption study several years ago that said that
> software was significant issue.  In particular bugs in BIOS and Linux
> kernel led to random looking memory corruption.  Hopefully, issues
> that they noticed are fixed now.  Exact precentages probably do
> not matter much, because both hardware and software is changing.
> The point is that there are both hardware errors and software errors
> which without deep investigation are hard to distinguish from
> hardware ones.

Systems with ECC make it (relatively) easy to determine if you are having
memory errors.  Those without it leave you simply guessing as to the
nature of any "problems" you may experience.

>>   So, the possibility of a second fault coming along
>> while the first fault is still in place (and uncorrected) is small.
>>
>> OTOH, if the memory just sits there with the expectation that it
>> will retain its intended contents without a chance to be corrected
>> (by the ECC hardware), then bit-rot can continue increasing the
>> possibility of a second bit failing while the first is still failed.
> 
> Well, I you are concerned you can implement low priority process
> that will read RAM possibly doing some extra work (like detecting
> unexpected changes).

You have to design your system (particularly the OS) with this capability
in mind.  You ("it") needs to know which regions of memory are invariant
over which intervals.

OTOH, simple periodic *testing* of memory can reveal hard errors
(google claimed hard errors to be more prevalent than soft ones).

But, this just gives you information; you still need mechanisms in
place that let you "retire" bad sections of memory (up to and
including ALL memory).

>> You can select MCUs that *do* have support for ECC instead of just
>> "hoping" the (inevitable) memory errors won't bite you.  Even so,
>> your best case scenario is just SECDED protection.
> 
> When working with MCU-s I have internal SRAM instead of DRAM.

SRAM is not without its own problems.  As geometries shrink and
capacities increase, the reliability of memory suffers.

> And in several cases manufacturers claim parity or ECC protection
> for SRAM.  But in case of PC-s hardware and OS are commodity.
> Due to price reasons I mostly deal with non-ECC hardware.
> 
>>> On software side I could try to add
>>> some extra error tolerance.  But I have various consistency checks
>>> and they indicate various troubles.  I deal with detected troubles,
>>> DRAM errors are not in this category.
>>
>> I assume you *test* memory on POST?
> 
> On PC-s that is BIOS and OS that I did not wrote.  And AFAIK BIOS
> POST is detecting memory size and doing a little sanity checks
> to detect gross troubles.  But I am not sure if I would call them
> "memory tests", better memory tests tend to run for days.

Many machines have "extended diagnostics" (often on a hidden disk
partition) that will let you run such.  Or, install an app to do so.

You have to know the types of errors you are attempting to detect
and design your tests to address those.  E.g., in the days of 1K and
4K DRAM, it was not uncommon to verify *refresh* was working properly
(as one often had to build the DRAM controller out of discrete logic
so you needed assurance that the refresh aspect was operating as
intended)

>>   But, if your device runs
>> 24/7/365, it can be a very long time between POSTs!  OTOH, you
>> could force a test cycle (either by deliberately restarting
>> your device -- "nightly maintenance") or, you could test memory
>> WHILE the application is running.
>>
>> And, what specific criteria do you use to get alarmed at the results??
> 
> As I wrote I am doing computations and criteria are problem specific.
> For example I have two multidigit numbers and one is supposed to
> exactly divide the other.  If not, software signals an error.

But you still don't know *why* the error came about.

> I read several papers about DRAM errors and I take seriously
> possiblity that they can happen.  But simply at my scale they
> do not seem to matter.
> 
> BTW: It seems that currently large fraction of errors (both software
> and hardware ones) appear semi-randomly.  So, to estimate
> reliabilty one should use statistic methods.  But if you aim
> at high reliablity, then needed sample size may be impractically
> large.  You may be able to add mitigations for rare problems that
> you can predict/guess, but you will be left with unpredictable

You can only ever address the known unknowns.  Discovery of unknown
unknowns is serendipitous.

PC applications (the whole PC environment) is likely more tolerant
of memory errors; new applications are always loaded over the previous]
one, the user can always rerun an application/calculation that is
suspect, you can reboot, etc.

The more demanding scenario is using DRAM as "EPROM" for load once
applications (appliances).

> ones.  In other words, it make sense to concentrate on problems
> that you see (you including your customers).  AFAIK some big
> companies have wordwide automatic error reporting systems.
> If you set up such a thing that you may get useful info.
> 
> You can try "error injection", that is run tests with extra
> component that simulates memory errors.  Then you will have
> some info about effects:
> - do memory errors cause incorrect operation?
> - is incorrect operation detected?

========== REMAINDER OF ARTICLE TRUNCATED ==========