Article <vfe1g0$3csph$1@paganini.bofh.team>

Deutsch English Français Italiano
<vfe1g0$3csph$1@paganini.bofh.team>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder8.news.weretis.net!newsfeed.bofh.team!paganini.bofh.team!not-for-mail
From: antispam@fricas.org (Waldek Hebisch)
Newsgroups: comp.arch.embedded
Subject: Re: Diagnostics
Date: Thu, 24 Oct 2024 17:52:02 -0000 (UTC)
Organization: To protect and to server
Message-ID: <vfe1g0$3csph$1@paganini.bofh.team>
References: <veekcp$9rsj$1@dont-email.me> <veuggc$1l5eo$1@paganini.bofh.team> <77k5hjprfq0ipjp6pcdd03lnph1i76ssuu@4ax.com> <veunj9$3gbqs$2@dont-email.me> <vev398$1r4v5$2@paganini.bofh.team> <vev635$3mf56$1@dont-email.me> <vevag2$1ricg$1@paganini.bofh.team> <vevbss$3mr5m$2@dont-email.me>
Injection-Date: Thu, 24 Oct 2024 17:52:02 -0000 (UTC)
Injection-Info: paganini.bofh.team; logging-data="3568433"; posting-host="WwiNTD3IIceGeoS5hCc4+A.user.paganini.bofh.team"; mail-complaints-to="usenet@bofh.team"; posting-account="9dIQLXBM7WM9KzA+yjdR4A";
User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (Linux/6.1.0-9-amd64 (x86_64))
X-Notice: Filtered by postfilter v. 0.9.3
Bytes: 6789
Lines: 124

Don Y <blockedofcourse@foo.invalid> wrote:
> On 10/18/2024 8:53 PM, Waldek Hebisch wrote:
>>> One of the FETs that controls the shifting of the automatic
>>> transmission as failed open.  How do you detect that /and recover
>>> from it/?
>> 
>> Detecting such thing looks easy.  Recovery is tricky, because
>> if you have spare FET and activate it there is good chance that
>> it will fail due to the same reason that the first FET failed.
>> OTOH, if you have propely designed circuit around the FET,
>> disturbance strong enough to kill the FET is likely to kill
>> the controller too.
> 
> The immediate goal is to *detect* that a problem exists.
> If you can't detect, then attempting to recover is a moot point.

In a car you have signals from wheels and engine, you can use
those to compute transmission ratio and check is it is expected
one.  Or simply have extra inputs which mountor FET output.
> 
>>> The camera/LIDAR that the self-drive feature uses is providing
>>> incorrect data...  etc.
>> 
>> Use 3 (or more) and voting.  Of course, this increases cost and one
>> have to judge if increase of cost is worth increase in safety
> 
> As well as the reliability of the additional "voting logic".
> If not a set of binary signals, determining what the *correct*
> signal may be can be problematic.

Matching images is now a stanard technology.  And in this case
"voting logic" is likely to be software and main trouble are
possible bugs.

>> (in self-driving car using multiple sensors looks like no-brainer,
>> but if this is just an assist to increase driver comfort then
>> result may be different).
> 
> It is different only in the sense of liability and exposure to
> loss.  I am not assigning values to those consequences but,
> rather, looking to address the issue of run-time testing, in
> general.

I doubt in general solutions.  Various parts of your system
may have enough common features to allow single strategy
in your system.  But it is unlikely to generalize to other
systems.  To put it differently, there are probabilites
of various events and associated costs.  Even if you
refuse to quantify probabilites and costs your design
decisions (assuming they are rational) will give some
estimate of them.

> Even if NONE of the failures can result in injury or loss,
> it is unlikely that a user WANTS to have a defective product.
> If the user is technically unable to determine when the
> product is "at fault" (vs. his own misunderstanding of how it
> is *supposed* to work), then those failures contribute to
> the users' frustrations with the product.
> 
>>> There are innumerable failures that can occur to compromise
>>> the "system" and no *easy*/inexpensive/reliable way to detect
>>> and recover from *all* of them.
>> 
>> Sure.  But for common failures or serious failures having non-negligible
>> pobability redundancy may offer cheap way to increase reliability.
>> 
>>>> For critical functions a car could have 3 processors with
>>>> voting circuitry.  With separate chips this would be more expensive
>>>> than single processor, but increase of cost probably would be
>>>> negligible compared to cost of the whole car.  And when integrated
>>>> on a single chip cost difference would be tiny.
>>>>
>>>> IIUC car controller may "reboot" during a ride.  Intead of
>>>> rebooting it could handle work to a backup controller.
>>>
>>> How do you know the circuitry (and other mechanisms) that
>>> implement this hand-over are operational?
>> 
>> It does not matter if handover _always_ works.  What matter is
>> if system with handover has lower chance of failure than
>> system without handover.  Having statistics of actual failures
>> (which I do not have but manufacturers should have) and
>> after some testing one can estimate failure probablity of
>> different designs and possibly decide to use handover.
> 
> Again, I am not interested in "recovery" as that varies with
> the application and risk assessment.  What I want to concentrate
> on is reliably *detecting* faults before they lead to product
> failures.
> 
> I contend that the hardware in many devices has that capability
> (to some extent) but that it is underutilized; that the issue
> of detecting faults *after* POST is one that doesn't see much
> attention.  The likely thinking being that POST will flag it the
> next time the device is restarted.
> 
> And, that's not acceptable in long-running devices.

Well, you write that you do not try to build high reliablity
device.  However device which correctly operates for years
without interruption is considered "high availability" device
which is a king of high reliablity.  And techniques for high
reliablity seem appropiate here.

>>> It is VERY difficult to design reliable systems.  I am not
>>> attempting that.  Rather, I am trying to address the fact that
>>> the reassurances POST (and, at the user's perogative, BIST)
>>> are not guaranteed when a device runs "for long periods of time".
>> 
>> You may have tests essentially as part of normal operation.
> 
> I suspect most folks have designed devices with UARTs.  And,
> having written a driver for it, have noted that framing, parity
> and overrun errors are possible.
> 
> Ask yourself how many of those systems ever *use* that information!
> Is there even a means of propagating it up out of the driver?

Well, I always use no parity transmission mode.  Standard way is
to use checksums and acknowledgments.  That way you know if
transmission is working correctly.  What extra info you expect
from looking at detailed error info from UART?

-- 
                              Waldek Hebisch