Article <vevbss$3mr5m$2@dont-email.me>

Deutsch English Français Italiano
<vevbss$3mr5m$2@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Don Y <blockedofcourse@foo.invalid>
Newsgroups: comp.arch.embedded
Subject: Re: Diagnostics
Date: Fri, 18 Oct 2024 21:17:21 -0700
Organization: A noiseless patient Spider
Lines: 105
Message-ID: <vevbss$3mr5m$2@dont-email.me>
References: <veekcp$9rsj$1@dont-email.me> <veuggc$1l5eo$1@paganini.bofh.team>
 <77k5hjprfq0ipjp6pcdd03lnph1i76ssuu@4ax.com> <veunj9$3gbqs$2@dont-email.me>
 <vev398$1r4v5$2@paganini.bofh.team> <vev635$3mf56$1@dont-email.me>
 <vevag2$1ricg$1@paganini.bofh.team>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 19 Oct 2024 06:17:33 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="89048b1778d1f63268abb85022497358";
	logging-data="3894454"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18YbfBe4X3PkxPeLIeHGVFN"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.2.2
Cancel-Lock: sha1:n9OaHixBFeGrzGt13Pj+sWWSsuc=
Content-Language: en-US
In-Reply-To: <vevag2$1ricg$1@paganini.bofh.team>
Bytes: 6075

On 10/18/2024 8:53 PM, Waldek Hebisch wrote:
>> One of the FETs that controls the shifting of the automatic
>> transmission as failed open.  How do you detect that /and recover
>> from it/?
> 
> Detecting such thing looks easy.  Recovery is tricky, because
> if you have spare FET and activate it there is good chance that
> it will fail due to the same reason that the first FET failed.
> OTOH, if you have propely designed circuit around the FET,
> disturbance strong enough to kill the FET is likely to kill
> the controller too.

The immediate goal is to *detect* that a problem exists.
If you can't detect, then attempting to recover is a moot point.

>> The camera/LIDAR that the self-drive feature uses is providing
>> incorrect data...  etc.
> 
> Use 3 (or more) and voting.  Of course, this increases cost and one
> have to judge if increase of cost is worth increase in safety

As well as the reliability of the additional "voting logic".
If not a set of binary signals, determining what the *correct*
signal may be can be problematic.

> (in self-driving car using multiple sensors looks like no-brainer,
> but if this is just an assist to increase driver comfort then
> result may be different).

It is different only in the sense of liability and exposure to
loss.  I am not assigning values to those consequences but,
rather, looking to address the issue of run-time testing, in
general.

Even if NONE of the failures can result in injury or loss,
it is unlikely that a user WANTS to have a defective product.
If the user is technically unable to determine when the
product is "at fault" (vs. his own misunderstanding of how it
is *supposed* to work), then those failures contribute to
the users' frustrations with the product.

>> There are innumerable failures that can occur to compromise
>> the "system" and no *easy*/inexpensive/reliable way to detect
>> and recover from *all* of them.
> 
> Sure.  But for common failures or serious failures having non-negligible
> pobability redundancy may offer cheap way to increase reliability.
> 
>>> For critical functions a car could have 3 processors with
>>> voting circuitry.  With separate chips this would be more expensive
>>> than single processor, but increase of cost probably would be
>>> negligible compared to cost of the whole car.  And when integrated
>>> on a single chip cost difference would be tiny.
>>>
>>> IIUC car controller may "reboot" during a ride.  Intead of
>>> rebooting it could handle work to a backup controller.
>>
>> How do you know the circuitry (and other mechanisms) that
>> implement this hand-over are operational?
> 
> It does not matter if handover _always_ works.  What matter is
> if system with handover has lower chance of failure than
> system without handover.  Having statistics of actual failures
> (which I do not have but manufacturers should have) and
> after some testing one can estimate failure probablity of
> different designs and possibly decide to use handover.

Again, I am not interested in "recovery" as that varies with
the application and risk assessment.  What I want to concentrate
on is reliably *detecting* faults before they lead to product
failures.

I contend that the hardware in many devices has that capability
(to some extent) but that it is underutilized; that the issue
of detecting faults *after* POST is one that doesn't see much
attention.  The likely thinking being that POST will flag it the
next time the device is restarted.

And, that's not acceptable in long-running devices.

>> It is VERY difficult to design reliable systems.  I am not
>> attempting that.  Rather, I am trying to address the fact that
>> the reassurances POST (and, at the user's perogative, BIST)
>> are not guaranteed when a device runs "for long periods of time".
> 
> You may have tests essentially as part of normal operation.

I suspect most folks have designed devices with UARTs.  And,
having written a driver for it, have noted that framing, parity
and overrun errors are possible.

Ask yourself how many of those systems ever *use* that information!
Is there even a means of propagating it up out of the driver?

> Of course, if you have single-tasked design with a task which
> must be "always" ready to respond, then running test becomes
> more complicated.  But in most designs you can spare enough
> time slots to run tests during normal operation.  Tests may
> interfere with normal operation, but here we are in domain
> specific teritory: sometimes result of operation give enough
> assurance that device is operating correctly.  And if testing
> for correct operation is impossible, then there is nothing to
> do, I certainly do not promise to deliver impossible.