Article <vfdsu1$3cled$1@paganini.bofh.team>

Deutsch English Français Italiano
<vfdsu1$3cled$1@paganini.bofh.team>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!feeds.phibee-telecom.net!2.eu.feeder.erje.net!feeder.erje.net!newsfeed.bofh.team!paganini.bofh.team!not-for-mail
From: antispam@fricas.org (Waldek Hebisch)
Newsgroups: comp.arch.embedded
Subject: Re: Diagnostics
Date: Thu, 24 Oct 2024 16:34:11 -0000 (UTC)
Organization: To protect and to server
Message-ID: <vfdsu1$3cled$1@paganini.bofh.team>
References: <veekcp$9rsj$1@dont-email.me> <veuggc$1l5eo$1@paganini.bofh.team> <veummc$3gbqs$1@dont-email.me> <vev7cu$1rdu5$1@paganini.bofh.team> <vevb65$3mr5m$1@dont-email.me> <vf0dkr$1t54l$1@paganini.bofh.team> <vf0oak$3v9qi$1@dont-email.me>
Injection-Date: Thu, 24 Oct 2024 16:34:11 -0000 (UTC)
Injection-Info: paganini.bofh.team; logging-data="3560909"; posting-host="WwiNTD3IIceGeoS5hCc4+A.user.paganini.bofh.team"; mail-complaints-to="usenet@bofh.team"; posting-account="9dIQLXBM7WM9KzA+yjdR4A";
User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (Linux/6.1.0-9-amd64 (x86_64))
X-Notice: Filtered by postfilter v. 0.9.3
Bytes: 9446
Lines: 174

Don Y <blockedofcourse@foo.invalid> wrote:
> On 10/19/2024 6:53 AM, Waldek Hebisch wrote:
>>>>> And, the *extent* to which testing is done is the subject
>>>>> addressed; if I ensure "stuff" *WORKED* when the device was
>>>>> powered on (preventing it from continuing on to its normal
>>>>> functionality in the event that some failure was detected),
>>>>> what assurance does that give me that the device's integrity
>>>>> is still intact 8760 hours (1 yr) hours later?  720 hours
>>>>> (1 mo)?  168 hours (1 wk)?  24 hours?  *1* hour????
>>>>
>>>> What to test is really domain-specific.  Traditional thinking
>>>> is that computer hardware is _much_ more reliable than
>>>> software and software bugs are major source of misbehaviour.
>>>
>>> That hasn't been *proven*.  And, "misbehavior" is not the same
>>> as *failure*.
>> 
>> First, I mean relevant hardware, that is hardware inside a MCU.
>> I think that there are strong arguments that such hardware is
>> more reliable than software.  I have seen claim based on analysis
>> of discoverd failures that software written to rigorous development
>> standars exhibits on average about 1 bug (that lead to failure) per
>> 1000 lines of code.  This means that evan small MCU has enough
>> space of handful of bugs.  And for bigger systems it gets worse.
> 
> But bugs need not be consequential.  They may be undesirable or
> even annoying but need not have associated "costs".

The point is that you can not eliminate all bugs.  Rather, you
should have simple code with aim of preventing "cost" of bugs.

>>>> And among hardware failures transient upsets, like flipped
>>>> bit are more likely than permanent failure.  For example,
>>>
>>> That used to be the thinking with DRAM but studies have shown
>>> that *hard* failures are more common.  These *can* be found...
>>> *if* you go looking for them!
>> 
>> I another place I wrote the one of studies that I saw claimed that
>> significant number of errors they detected (they monitored changes
>> to a memory area that was supposed to be unmodifed) was due to buggy
>> software.  And DRAM is special.
> 
> If you have memory protection hardware (I do), then such changes
> can't casually occur; the software has to make a deliberate
> attempt to tell the memory controller to allow such a change.

The tests where run on Linux boxes with normal memory protection.
Memory protection does not prevent troubles due to bugs in
priviledged code.  Of course, you can think that you can do
better than Linux programmers.

>>> E.g., if you load code into RAM (from FLASH) for execution,
>>> are you sure the image *in* the RAM is the image from the FLASH?
>>> What about "now"?  And "now"?!
>> 
>> You are supposed to regularly verify sufficiently strong checksum.
> 
> Really?  Wanna bet that doesn't happen?  How many Linux-based devices
> load applications and start a process to continuously verify the
> integrity of the TEXT segment?

Using something like Linux means that you do not care about rare
problems (or are prepared to resolve them without help of OS).

> What are they going to do if they notice a discrepancy?  Reload
> the application and hope it avoids any "soft spots" in memory?

AFAICS the rule about checking image originally were inteded
for devices executing code directly from flash, if your "primary
truth" fails possibilities are limited.  With DRAM failures one
can do much better.  The question is mainly probabilities and
effort.

>>>> at low safety level you may assume that hardware of a counter
>>>> generating PWM-ed signal works correctly, but you are
>>>> supposed to periodically verify that configuration registers
>>>> keep expected values.
>>>
>>> Why would you expect the registers to lose their settings?
>>> Would you expect the CPUs registers to be similarly flakey?
>> 
>> First, such checking is not my idea, but one point from checklist for
>> low safety devices.  Registers may change due to bugs, EMC events,
>> cosmic rays and similar.
> 
> Then you are dealing with high reliability designs.  Do you
> really think my microwave oven, stove, furnace, telephone,
> etc. are designed to be resilient to those types of faults?
> Do you think the user could detect such an occurrence?

IIUC microwave, stove and furnace should be.  In cell phone
BMS should be safe and core radio is tightly regulated.  Other
parts seem to be at quality/reliability level of PC-s.

You clearly want to make your devices more reliable.  Bugs
and various events happen and extra checking is actually
quite cheap.  It is for you to decide if you need/want
it.

>>>> Historically OS-es had a map of bad blocks on the disc and
>>>> avoided allocating them.  In principle on system with paging
>>>> hardware the same could be done for DRAM, but I do not think
>>>> anybody is doing this (if domain is serious enough to worry
>>>> about DRAM failures, then it probaly have redundant independent
>>>> computers with ECC DRAM).
>>>
>>> Using ECC DRAM doesn't solve the problem.  If you see errors
>>> reported by your ECC RAM (corrected errors), then when do
>>> you decide you are seeing too many and losing confidence that
>>> the ECC is actually *detecting* all multibit errors?
>> 
>> ECC is part of solution.  It may reduce probability of error
>> so that you consider them not serious enough.  And if you
>> really care you may try to increase error rate (say by putting
>> RAM chips at increased temperature) and test that your detection
>> and recovery strategy works OK.
> 
> Studies suggest that temperature doesn't play the role that
> was suspected.  What ECC does is give you *data* about faults.
> Without it, you have no way to know about faults /as they
> occur/.

Well, there is evidence that increased temperature inreases
chance of errors.  More precisely, expect errors when you
operate DRAM close to max allowed temperature.  The point is
that you can cause errors and that way test your recovery
strategy (untested recovery code is likely to fail when/if
it is needed).

> Testing tries to address faults at different points in their
> lifespans.  Predictive Failure Analysis tries to alert to the
> likelihood of *impending* failures BEFORE they occur.  So,
> whatever remedial action you might take can happen BEFORE
> something has failed.  POST serves a similar role but tries to
> catch failures that have *occurred* before they can affect the
> operation of the device.  BIST gives the user a way of making
> that determination (or receiving reassurance) "on demand".
> Run time diagnostics address testing while the device wants
> to remain in operation.
> 
> What you *do* about a failure is up to you, your market and the
> expectations of your users.  If a battery fails in SOME of my
> UPSs, they simply won't power on (and, if the periodic run-time
> test is enabled, that test will cause them to unceremoniously
> power themselves OFF as they try to switch to battery power).
> Other UPSs will provide an alert (audible/visual/log message)
> of the fact but give me the option of continuing to POWER
> those devices in the absence of backup protection.
> 
> The latter is far more preferable to me as I can then decide
> when/if I want to replace the batteries without being forced
> to do so, *now*.
>
> The same is not true of smoke/CO detectors; when they detect
> a failed (failING battery), they are increasingly annoying
> in their insistence that the problem be addressed, now.
> So much so, that it leads to deaths due to the detector
> being taken out of service to stop the damn bleating.
> 
> I have a great deal of latitude in how I handle failures.
> For example, I can busy-out more than 90% of the RAM in a device
> (if something suggested that it was unreliable) and *still*
> provide the functionality of that node -- by running the code
> on another node and leaving just the hardware drivers associated
> with *this* node in place.  So, I can alert a user that a
> particular device is in need of service -- yet, continue
> to provide the services that were associated with that device.
> IMO, this is the best of all possible "failure" scenarios;
> the worst being NOT knowing that something is misbehaving.

Good.

-- 
                              Waldek Hebisch