Path: ...!weretis.net!feeder8.news.weretis.net!newsfeed.bofh.team!paganini.bofh.team!not-for-mail
From: antispam@fricas.org (Waldek Hebisch)
Newsgroups: comp.arch.embedded
Subject: Re: Diagnostics
Date: Sat, 19 Oct 2024 13:53:33 -0000 (UTC)
Organization: To protect and to server
Message-ID: <vf0dkr$1t54l$1@paganini.bofh.team>
References: <veekcp$9rsj$1@dont-email.me> <veuggc$1l5eo$1@paganini.bofh.team> <veummc$3gbqs$1@dont-email.me> <vev7cu$1rdu5$1@paganini.bofh.team> <vevb65$3mr5m$1@dont-email.me>
Injection-Date: Sat, 19 Oct 2024 13:53:33 -0000 (UTC)
Injection-Info: paganini.bofh.team; logging-data="2004117"; posting-host="WwiNTD3IIceGeoS5hCc4+A.user.paganini.bofh.team"; mail-complaints-to="usenet@bofh.team"; posting-account="9dIQLXBM7WM9KzA+yjdR4A";
User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (Linux/6.1.0-9-amd64 (x86_64))
X-Notice: Filtered by postfilter v. 0.9.3
Bytes: 6048
Lines: 112

Don Y <blockedofcourse@foo.invalid> wrote:
> On 10/18/2024 8:00 PM, Waldek Hebisch wrote:
>> Don Y <blockedofcourse@foo.invalid> wrote:
>>> On 10/18/2024 1:30 PM, Waldek Hebisch wrote:
>>>> Don Y <blockedofcourse@foo.invalid> wrote:
>>>>> There, runtime diagnostics are the only alternative for hardware
>>>>> revalidation, PFA and diagnostics.
>>>>>
>>>>> How commonly are such mechanisms implemented?  And, how thoroughly?
>>>>
>>>> This is strange question.  AFAIK automatically run diagnostics/checks
>>>> are part of safety regulations.
>>>
>>> Not all devices are covered by "regulations".
>> 
>> Well, if device matters then there is implied liabilty
>> and nobody want to admit doing bad job.  If device
>> does not matter, then answer to the original question
>> also does not matter.
> 
> In the US, ANYTHING can result in a lawsuit.  But, "due diligence"
> can insulate the manufacturer, to some extent.  No one ever
> *admits* to "doing a bad job".
> 
> If your doorbell malfunctions, what "damages" are you going
> to claim?  If your garage door doesn't open when commanded?
> If your yard doesn't get watered?  If you weren't promptly
> notified that the mail had just been delivered?  Or, that
> the compressor in the freezer had failed and your foodstuffs
> had spoiled, as a result?
> 
> The costs of litigation are reasonably high.  Lawyers want
> to see the LIKELIHOOD of a big payout before entertaining
> such litigation.

Each item above may contribute to a significant loss.  And
there could push to litigation (say by a consumer advocacy group)
basically to establish a precedent.  So, better have
record of due diligence.

>>> And, the *extent* to which testing is done is the subject
>>> addressed; if I ensure "stuff" *WORKED* when the device was
>>> powered on (preventing it from continuing on to its normal
>>> functionality in the event that some failure was detected),
>>> what assurance does that give me that the device's integrity
>>> is still intact 8760 hours (1 yr) hours later?  720 hours
>>> (1 mo)?  168 hours (1 wk)?  24 hours?  *1* hour????
>> 
>> What to test is really domain-specific.  Traditional thinking
>> is that computer hardware is _much_ more reliable than
>> software and software bugs are major source of misbehaviour.
> 
> That hasn't been *proven*.  And, "misbehavior" is not the same
> as *failure*.

First, I mean relevant hardware, that is hardware inside a MCU.
I think that there are strong arguments that such hardware is
more reliable than software.  I have seen claim based on analysis
of discoverd failures that software written to rigorous development
standars exhibits on average about 1 bug (that lead to failure) per
1000 lines of code.  This means that evan small MCU has enough
space of handful of bugs.  And for bigger systems it gets worse.

>> And among hardware failures transient upsets, like flipped
>> bit are more likely than permanent failure.  For example,
> 
> That used to be the thinking with DRAM but studies have shown
> that *hard* failures are more common.  These *can* be found...
> *if* you go looking for them!

I another place I wrote the one of studies that I saw claimed that
significant number of errors they detected (they monitored changes
to a memory area that was supposed to be unmodifed) was due to buggy
software.  And DRAM is special.

> E.g., if you load code into RAM (from FLASH) for execution,
> are you sure the image *in* the RAM is the image from the FLASH?
> What about "now"?  And "now"?!

You are supposed to regularly verify sufficiently strong checksum.

>> at low safety level you may assume that hardware of a counter
>> generating PWM-ed signal works correctly, but you are
>> supposed to periodically verify that configuration registers
>> keep expected values.
> 
> Why would you expect the registers to lose their settings?
> Would you expect the CPUs registers to be similarly flakey?

First, such checking is not my idea, but one point from checklist for
low safety devices.  Registers may change due to bugs, EMC events,
cosmic rays and similar.

>> Historically OS-es had a map of bad blocks on the disc and
>> avoided allocating them.  In principle on system with paging
>> hardware the same could be done for DRAM, but I do not think
>> anybody is doing this (if domain is serious enough to worry
>> about DRAM failures, then it probaly have redundant independent
>> computers with ECC DRAM).
> 
> Using ECC DRAM doesn't solve the problem.  If you see errors
> reported by your ECC RAM (corrected errors), then when do
> you decide you are seeing too many and losing confidence that
> the ECC is actually *detecting* all multibit errors?

ECC is part of solution.  It may reduce probability of error
so that you consider them not serious enough.  And if you
really care you may try to increase error rate (say by putting
RAM chips at increased temperature) and test that your detection
and recovery strategy works OK.

-- 
                              Waldek Hebisch