Path: ...!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Don Y <blockedofcourse@foo.invalid>
Newsgroups: comp.arch.embedded
Subject: Re: Diagnostics
Date: Thu, 24 Oct 2024 14:28:44 -0700
Organization: A noiseless patient Spider
Lines: 180
Message-ID: <vfee6k$2psko$1@dont-email.me>
References: <veekcp$9rsj$1@dont-email.me> <veuggc$1l5eo$1@paganini.bofh.team>
 <veummc$3gbqs$1@dont-email.me> <vev7cu$1rdu5$1@paganini.bofh.team>
 <vevb65$3mr5m$1@dont-email.me> <vf0dkr$1t54l$1@paganini.bofh.team>
 <vf0oak$3v9qi$1@dont-email.me> <vfdsu1$3cled$1@paganini.bofh.team>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 24 Oct 2024 23:28:54 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="7d3883bb28e29a199c42566abe9b37f1";
	logging-data="2945688"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX184UQgWta6CPg5TlQA7xaEf"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.2.2
Cancel-Lock: sha1:QnCz2hdmahYRLxOwIUaqFWRIfwM=
Content-Language: en-US
In-Reply-To: <vfdsu1$3cled$1@paganini.bofh.team>
Bytes: 9891

On 10/24/2024 9:34 AM, Waldek Hebisch wrote:
>>>> That hasn't been *proven*.  And, "misbehavior" is not the same
>>>> as *failure*.
>>>
>>> First, I mean relevant hardware, that is hardware inside a MCU.
>>> I think that there are strong arguments that such hardware is
>>> more reliable than software.  I have seen claim based on analysis
>>> of discoverd failures that software written to rigorous development
>>> standars exhibits on average about 1 bug (that lead to failure) per
>>> 1000 lines of code.  This means that evan small MCU has enough
>>> space of handful of bugs.  And for bigger systems it gets worse.
>>
>> But bugs need not be consequential.  They may be undesirable or
>> even annoying but need not have associated "costs".
> 
> The point is that you can not eliminate all bugs.  Rather, you
> should have simple code with aim of preventing "cost" of bugs.

Code need only be "as simple as possible, /but no simpler/".
The problem defines the complexity of the solution.

>>>>> And among hardware failures transient upsets, like flipped
>>>>> bit are more likely than permanent failure.  For example,
>>>>
>>>> That used to be the thinking with DRAM but studies have shown
>>>> that *hard* failures are more common.  These *can* be found...
>>>> *if* you go looking for them!
>>>
>>> I another place I wrote the one of studies that I saw claimed that
>>> significant number of errors they detected (they monitored changes
>>> to a memory area that was supposed to be unmodifed) was due to buggy
>>> software.  And DRAM is special.
>>
>> If you have memory protection hardware (I do), then such changes
>> can't casually occur; the software has to make a deliberate
>> attempt to tell the memory controller to allow such a change.
> 
> The tests where run on Linux boxes with normal memory protection.
> Memory protection does not prevent troubles due to bugs in
> priviledged code.  Of course, you can think that you can do
> better than Linux programmers.

Linux code is far from "as simple as possible".  They are constantly
trying to make a GENERAL PURPOSE solution for a wide variety of
applications that THEY envision.

>>>> E.g., if you load code into RAM (from FLASH) for execution,
>>>> are you sure the image *in* the RAM is the image from the FLASH?
>>>> What about "now"?  And "now"?!
>>>
>>> You are supposed to regularly verify sufficiently strong checksum.
>>
>> Really?  Wanna bet that doesn't happen?  How many Linux-based devices
>> load applications and start a process to continuously verify the
>> integrity of the TEXT segment?
> 
> Using something like Linux means that you do not care about rare
> problems (or are prepared to resolve them without help of OS).

Using <anything> means you don't care about any of the issues
that the <anything> developers considered unimportant or were
incapable of/unwilling to addressing.

>> What are they going to do if they notice a discrepancy?  Reload
>> the application and hope it avoids any "soft spots" in memory?
> 
> AFAICS the rule about checking image originally were inteded
> for devices executing code directly from flash, if your "primary
> truth" fails possibilities are limited.  With DRAM failures one
> can do much better.  The question is mainly probabilities and
> effort.

One typically doesn't assume flash fails WHILE in use (though, of course,
it does).  DRAM is documented to fail while in use.  If you have
"little enough" of it then you can hope the failures are far enough
apart, in time, that they just look like "bugs".  This is especially
true if your device is only running "part time" or is unobserved
for long stretches of time as that "bug" can manifest in numerous
ways depending on the nature of the DRAM fault and *if* the CPU
happens to encounter it.  E.g., just like a RAID array, absent
patrol reads, you never know if a file that hasn't been referenced
in months has suffered any corruption.

In many applications, there are large swaths of code that get
executed once or infrequently.  E.g., how often does the first
line of code after main() get executed?  If it was corrupted
after that initial execution/examination, would you know? or care?
Ah, but if you don't NOTICE that it has been corrupted, then you
will proceed gleefully ignorant of the fact that your memory
system is encountering problem(s) and, thus, won't take any steps
to address them.

>>>>> at low safety level you may assume that hardware of a counter
>>>>> generating PWM-ed signal works correctly, but you are
>>>>> supposed to periodically verify that configuration registers
>>>>> keep expected values.
>>>>
>>>> Why would you expect the registers to lose their settings?
>>>> Would you expect the CPUs registers to be similarly flakey?
>>>
>>> First, such checking is not my idea, but one point from checklist for
>>> low safety devices.  Registers may change due to bugs, EMC events,
>>> cosmic rays and similar.
>>
>> Then you are dealing with high reliability designs.  Do you
>> really think my microwave oven, stove, furnace, telephone,
>> etc. are designed to be resilient to those types of faults?
>> Do you think the user could detect such an occurrence?
> 
> IIUC microwave, stove and furnace should be.  In cell phone
> BMS should be safe and core radio is tightly regulated.  Other
> parts seem to be at quality/reliability level of PC-s.
> 
> You clearly want to make your devices more reliable.  Bugs
> and various events happen and extra checking is actually
> quite cheap.  It is for you to decide if you need/want
> it.

Unlike a phone or "small appliance" that you can carry in to a
service center -- or, return to the store where purchased -- I
can't expect a user to just pick up a failed/suspect device and
exchange it for a "new" one.  Could you remove the PCB that
controls your furnace and bring it <somewhere> to have someone
tell you if it is "acting up" and in need of replacement?
Would you even THINK to do this?

Instead, if you were having problems (or suspected you were)
with your "furnace", you would call a service man to come to
*it* and have a look.  Here (US), that is costly -- they aren't
going to drive to you without some sort of compensation
(and they aren't going to do so for "Uber rates").

E.g., a few winters past, the natural gas supply to our city was
"compromised"; it was unusually cold and demand exceeded the
ability of the system to deliver gas at sufficient pressure to
the entire city.

Most furnaces rely on a certain flow of fuel to operate.  And,
contain sensors to shutdown the furnace if they sense an
inadequate fuel supply.  So, much of the city had no heat.
This resulted to most plumbing contractors being overwhelmed
with calls for service.

Of course, there was nothing they could *do* to correct the problem.
But, that didn't stop them from taking orders for service,
dispatching their trucks and BILLING each of these customers.

One could argue that a more moral industry might have recommended
callers "wait a while as there is a citywide problem with the
gas supply".  But, maybe there was something ELSE at fault
as some of those callers?  And, it's a great opportunity to
get into their homes and try to make a sale to upgrade your
"old equipment" to unsuspecting homeowners (an HVAC system is
in the $10K price range for a nominal home).

Bring your phone into a store complaining of a problem and
they will likely show you what you are doing wrong.  They *may*
suggest you upgrade that 3 year-old-model -- but, if that
is the sole reason for their suggested upgrade, you will likely
decline and walk away.  "It's just a phone; if it keeps giving
me problems, THEN I will upgrade".  That's not the case with
something "magical" (in the eyes of a homeowner) like an
HVAC system.  Upgrading later may be even less convenient than
it is now!  And, it actually *is* an old system... (who defines
old?)

>> Studies suggest that temperature doesn't play the role that
>> was suspected.  What ECC does is give you *data* about faults.
>> Without it, you have no way to know about faults /as they
>> occur/.
> 
> Well, there is evidence that increased temperature inreases
> chance of errors.  More precisely, expect errors when you
> operate DRAM close to max allowed temperature.  The point is
> that you can cause errors and that way test your recovery
========== REMAINDER OF ARTICLE TRUNCATED ==========