Deutsch   English   Français   Italiano  
<vf0oak$3v9qi$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Don Y <blockedofcourse@foo.invalid>
Newsgroups: comp.arch.embedded
Subject: Re: Diagnostics
Date: Sat, 19 Oct 2024 09:55:35 -0700
Organization: A noiseless patient Spider
Lines: 197
Message-ID: <vf0oak$3v9qi$1@dont-email.me>
References: <veekcp$9rsj$1@dont-email.me> <veuggc$1l5eo$1@paganini.bofh.team>
 <veummc$3gbqs$1@dont-email.me> <vev7cu$1rdu5$1@paganini.bofh.team>
 <vevb65$3mr5m$1@dont-email.me> <vf0dkr$1t54l$1@paganini.bofh.team>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 19 Oct 2024 18:55:49 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="89048b1778d1f63268abb85022497358";
	logging-data="4171602"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19R2sMN8jqOD7qI6i0oA07X"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.2.2
Cancel-Lock: sha1:IrUdUS875u8c6nnCPFlBoiqQPK0=
In-Reply-To: <vf0dkr$1t54l$1@paganini.bofh.team>
Content-Language: en-US
Bytes: 10603

On 10/19/2024 6:53 AM, Waldek Hebisch wrote:
>> In the US, ANYTHING can result in a lawsuit.  But, "due diligence"
>> can insulate the manufacturer, to some extent.  No one ever
>> *admits* to "doing a bad job".
>>
>> If your doorbell malfunctions, what "damages" are you going
>> to claim?  If your garage door doesn't open when commanded?
>> If your yard doesn't get watered?  If you weren't promptly
>> notified that the mail had just been delivered?  Or, that
>> the compressor in the freezer had failed and your foodstuffs
>> had spoiled, as a result?
>>
>> The costs of litigation are reasonably high.  Lawyers want
>> to see the LIKELIHOOD of a big payout before entertaining
>> such litigation.
> 
> Each item above may contribute to a significant loss.  And

Significant loss?  From a doorbell failing to ring?  Are you
sure YOUR doorbell has rung EVERY time someone pressed the button?

> there could push to litigation (say by a consumer advocacy group)
> basically to establish a precedent.  So, better have
> record of due diligence.

But things can *still* "fail to perform".  That;s the whole point of
runtime diagnostics; to notice a failure that the user may NOT!
If you can take remedial action, then you have a notification that
it is needed.  If this requires the assistance of the user, then
you can REQUEST that.  If you can offload some of the responsibilities
of the device (something that I can do, dynamically), then you
can elect to do so.  If you can do nothing to keep the device in
service, then you can alert the user of the need for replacement.

*NOT* knowing of a fault means you gleefully keep operating as
if everything was fine.

>>>> And, the *extent* to which testing is done is the subject
>>>> addressed; if I ensure "stuff" *WORKED* when the device was
>>>> powered on (preventing it from continuing on to its normal
>>>> functionality in the event that some failure was detected),
>>>> what assurance does that give me that the device's integrity
>>>> is still intact 8760 hours (1 yr) hours later?  720 hours
>>>> (1 mo)?  168 hours (1 wk)?  24 hours?  *1* hour????
>>>
>>> What to test is really domain-specific.  Traditional thinking
>>> is that computer hardware is _much_ more reliable than
>>> software and software bugs are major source of misbehaviour.
>>
>> That hasn't been *proven*.  And, "misbehavior" is not the same
>> as *failure*.
> 
> First, I mean relevant hardware, that is hardware inside a MCU.
> I think that there are strong arguments that such hardware is
> more reliable than software.  I have seen claim based on analysis
> of discoverd failures that software written to rigorous development
> standars exhibits on average about 1 bug (that lead to failure) per
> 1000 lines of code.  This means that evan small MCU has enough
> space of handful of bugs.  And for bigger systems it gets worse.

But bugs need not be consequential.  They may be undesirable or
even annoying but need not have associated "costs".

I have a cassette deck (Nakamichi Dragon) that has a design flaw.
When the tape reaches the end of side "A", it is supposed to
autoreverse and play the "back side".  So, the revolutions
counter counts *up* while playing side A and then back down
while playing side B.

However, if you eject the tape just as side A finishes and
physically flip it over (so side B is the "front" side)
pressing FORWARD PLAY (which is the direction that the
reels were moving while the tape counter was counting UP),
the tape will move FORWARD but the reels will count backwards.

If you had removed the tape and placed some OTHER tape in
the mechanism, the same behavior results (obviously) -- moving
forward but counting backwards.  If you turn the deck OFF
and then back ON, the tape counter moves correctly.

How am I harmed by this?  To what monetary extent?  It's
a race in the hardware & software (the tape counter is
implemented in a separate MCU).  I can avoid the problem
by NOT ejecting the tape just after the completion of
side A...

>>> And among hardware failures transient upsets, like flipped
>>> bit are more likely than permanent failure.  For example,
>>
>> That used to be the thinking with DRAM but studies have shown
>> that *hard* failures are more common.  These *can* be found...
>> *if* you go looking for them!
> 
> I another place I wrote the one of studies that I saw claimed that
> significant number of errors they detected (they monitored changes
> to a memory area that was supposed to be unmodifed) was due to buggy
> software.  And DRAM is special.

If you have memory protection hardware (I do), then such changes
can't casually occur; the software has to make a deliberate
attempt to tell the memory controller to allow such a change.

>> E.g., if you load code into RAM (from FLASH) for execution,
>> are you sure the image *in* the RAM is the image from the FLASH?
>> What about "now"?  And "now"?!
> 
> You are supposed to regularly verify sufficiently strong checksum.

Really?  Wanna bet that doesn't happen?  How many Linux-based devices
load applications and start a process to continuously verify the
integrity of the TEXT segment?

What are they going to do if they notice a discrepancy?  Reload
the application and hope it avoids any "soft spots" in memory?

>>> at low safety level you may assume that hardware of a counter
>>> generating PWM-ed signal works correctly, but you are
>>> supposed to periodically verify that configuration registers
>>> keep expected values.
>>
>> Why would you expect the registers to lose their settings?
>> Would you expect the CPUs registers to be similarly flakey?
> 
> First, such checking is not my idea, but one point from checklist for
> low safety devices.  Registers may change due to bugs, EMC events,
> cosmic rays and similar.

Then you are dealing with high reliability designs.  Do you
really think my microwave oven, stove, furnace, telephone,
etc. are designed to be resilient to those types of faults?
Do you think the user could detect such an occurrence?

>>> Historically OS-es had a map of bad blocks on the disc and
>>> avoided allocating them.  In principle on system with paging
>>> hardware the same could be done for DRAM, but I do not think
>>> anybody is doing this (if domain is serious enough to worry
>>> about DRAM failures, then it probaly have redundant independent
>>> computers with ECC DRAM).
>>
>> Using ECC DRAM doesn't solve the problem.  If you see errors
>> reported by your ECC RAM (corrected errors), then when do
>> you decide you are seeing too many and losing confidence that
>> the ECC is actually *detecting* all multibit errors?
> 
> ECC is part of solution.  It may reduce probability of error
> so that you consider them not serious enough.  And if you
> really care you may try to increase error rate (say by putting
> RAM chips at increased temperature) and test that your detection
> and recovery strategy works OK.

Studies suggest that temperature doesn't play the role that
was suspected.  What ECC does is give you *data* about faults.
Without it, you have no way to know about faults /as they
occur/.

Testing tries to address faults at different points in their
lifespans.  Predictive Failure Analysis tries to alert to the
likelihood of *impending* failures BEFORE they occur.  So,
whatever remedial action you might take can happen BEFORE
something has failed.  POST serves a similar role but tries to
catch failures that have *occurred* before they can affect the
operation of the device.  BIST gives the user a way of making
that determination (or receiving reassurance) "on demand".
Run time diagnostics address testing while the device wants
to remain in operation.

What you *do* about a failure is up to you, your market and the
expectations of your users.  If a battery fails in SOME of my
UPSs, they simply won't power on (and, if the periodic run-time
test is enabled, that test will cause them to unceremoniously
power themselves OFF as they try to switch to battery power).
Other UPSs will provide an alert (audible/visual/log message)
of the fact but give me the option of continuing to POWER
those devices in the absence of backup protection.

The latter is far more preferable to me as I can then decide
========== REMAINDER OF ARTICLE TRUNCATED ==========