Article <vevb65$3mr5m$1@dont-email.me>

Deutsch English Français Italiano
<vevb65$3mr5m$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.roellig-ltd.de!news.mb-net.net!open-news-network.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Don Y <blockedofcourse@foo.invalid>
Newsgroups: comp.arch.embedded
Subject: Re: Diagnostics
Date: Fri, 18 Oct 2024 21:05:14 -0700
Organization: A noiseless patient Spider
Lines: 162
Message-ID: <vevb65$3mr5m$1@dont-email.me>
References: <veekcp$9rsj$1@dont-email.me> <veuggc$1l5eo$1@paganini.bofh.team>
 <veummc$3gbqs$1@dont-email.me> <vev7cu$1rdu5$1@paganini.bofh.team>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 19 Oct 2024 06:05:27 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="89048b1778d1f63268abb85022497358";
	logging-data="3894454"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18AjDdTVHnAFIOpmTS0yEjv"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.2.2
Cancel-Lock: sha1:GnKBsqNKxeBjs076M6JQ4GLiXIQ=
In-Reply-To: <vev7cu$1rdu5$1@paganini.bofh.team>
Content-Language: en-US
Bytes: 8590

On 10/18/2024 8:00 PM, Waldek Hebisch wrote:
> Don Y <blockedofcourse@foo.invalid> wrote:
>> On 10/18/2024 1:30 PM, Waldek Hebisch wrote:
>>> Don Y <blockedofcourse@foo.invalid> wrote:
>>>> There, runtime diagnostics are the only alternative for hardware
>>>> revalidation, PFA and diagnostics.
>>>>
>>>> How commonly are such mechanisms implemented?  And, how thoroughly?
>>>
>>> This is strange question.  AFAIK automatically run diagnostics/checks
>>> are part of safety regulations.
>>
>> Not all devices are covered by "regulations".
> 
> Well, if device matters then there is implied liabilty
> and nobody want to admit doing bad job.  If device
> does not matter, then answer to the original question
> also does not matter.

In the US, ANYTHING can result in a lawsuit.  But, "due diligence"
can insulate the manufacturer, to some extent.  No one ever
*admits* to "doing a bad job".

If your doorbell malfunctions, what "damages" are you going
to claim?  If your garage door doesn't open when commanded?
If your yard doesn't get watered?  If you weren't promptly
notified that the mail had just been delivered?  Or, that
the compressor in the freezer had failed and your foodstuffs
had spoiled, as a result?

The costs of litigation are reasonably high.  Lawyers want
to see the LIKELIHOOD of a big payout before entertaining
such litigation.

>> And, the *extent* to which testing is done is the subject
>> addressed; if I ensure "stuff" *WORKED* when the device was
>> powered on (preventing it from continuing on to its normal
>> functionality in the event that some failure was detected),
>> what assurance does that give me that the device's integrity
>> is still intact 8760 hours (1 yr) hours later?  720 hours
>> (1 mo)?  168 hours (1 wk)?  24 hours?  *1* hour????
> 
> What to test is really domain-specific.  Traditional thinking
> is that computer hardware is _much_ more reliable than
> software and software bugs are major source of misbehaviour.

That hasn't been *proven*.  And, "misbehavior" is not the same
as *failure*.

> And among hardware failures transient upsets, like flipped
> bit are more likely than permanent failure.  For example,

That used to be the thinking with DRAM but studies have shown
that *hard* failures are more common.  These *can* be found...
*if* you go looking for them!

E.g., if you load code into RAM (from FLASH) for execution,
are you sure the image *in* the RAM is the image from the FLASH?
What about "now"?  And "now"?!

> at low safety level you may assume that hardware of a counter
> generating PWM-ed signal works correctly, but you are
> supposed to periodically verify that configuration registers
> keep expected values.

Why would you expect the registers to lose their settings?
Would you expect the CPUs registers to be similarly flakey?

>  IIUC cristal osciators are likely to fail
> so you are supposed to regularly check for presence of the clock
> and its frequency (this assumes hardware design with a backup
> clock).
> 
>>>   Even if some safety critical software
>>> does not contain them, nobody is going to admit violationg regulations.
>>> And things like PLC-s are "dual use", they may be used in non-safety
>>> role, but vendors claim compliance to safety standards.
>>
>> So, if a bit in a RAM in said device *dies* some time after power on,
>> is the device going to *know* that has happened?  And, signal its
>> unwillingness to continue operating?  What is going to detect that
>> failure?
> 
> I do not know how PLC manufactures implement checks.  Small
> PLC-s are based on MCU-s with static parity protected RAM.
> This may be deemed adequate.  PLC-s work in cycles and some
> percentage of the cycle is dedicated to self-test.  So big
> PLC may divide memory into smallish regions and in each
> cycle check a single region, walking trough whole memory.
> 
>> What if the bit's failure is inconsequential to the operation
>> of the device?  E.g., if the bit is part of some not-used
>> feature?  *Or*, if it has failed in the state it was *supposed*
>> to be in??!
> 
> I am affraid that usually inconsequential failure gets
> promoted to complete failure.  Before 2000 checking showed
> that several BIOS-es "validated" date and "incorrect" (that
> is after 1999) date prevented boot.

If *a* failure resulted in a catastrophic failure, things would
be "acceptable" in that the user would KNOW that something is
wrong without the device having to tell them.

But, too often, faults can be "absorbed" or lead to unobservable
errors in operation.  What then?

Somewhere, I have a paper where the researchers simulated faults
*in* various OS kernels to see how "tolerant" the OS was of these
faults (which we know *happen*).  One would think that *any*
fault would cause a crash.  Yet, MANY faults are sufferable
(depending on the OS).

Consider, if a single bit error converts a "JUMP" to a "JUMP IF CARRY"
but the carry happens to be set, then there is no difference in the
execution path.  If that bit error converts a "saturday" into a
"sunday", then something that is intended to execute on weekdays (or
weekends) won't care.  Etc.

> Historically OS-es had a map of bad blocks on the disc and
> avoided allocating them.  In principle on system with paging
> hardware the same could be done for DRAM, but I do not think
> anybody is doing this (if domain is serious enough to worry
> about DRAM failures, then it probaly have redundant independent
> computers with ECC DRAM).

Using ECC DRAM doesn't solve the problem.  If you see errors
reported by your ECC RAM (corrected errors), then when do
you decide you are seeing too many and losing confidence that
the ECC is actually *detecting* all multibit errors?

>> With a "good" POST design, you can reassure the user that the
>> device *appears* to be functional.  That the data/code stored in it
>> are intact (since last time they were accessed).  That the memory
>> is capable of storing any values that is called on to preserve.
>> That the hardware I/Os can control and sense as intended, etc.
>>
>> /But, you have no guarantee that this condition will persist!/
>> If it WAS guaranteed to persist, then the simple way to make high
>> reliability devices would be just to /never turn them off/ to
>> take advantage of this "guarantee"!
> 
> Everything here is domain specific.  In cheap MCU-based device main
> source of failurs is overvoltage/ESD on MCU pins.  This may
> kill the whole chip in which case no software protection can
> help.  Or some pins fail, sometimes this may be detected by reading
> appropiate port.  If you control electic motor then you probably
> do not want to sent test signals during normal motor operation.

That depends on HOW you generate your test signals, what the hardware
actually looks like and how sensitive the "mechanism" is to such
"disturbances".  Remember, "you" can see things faster than a mechanism
can often respond.  I.e., if applying power to the motor doesn't
result in an observable load current (or "micromotion"), then the
motor is likely not responding.

> But you are likely to have some feedback and can verify if feedback
> agrees with expected values.  If you get unexpected readings
> you probably will stop the motor.