Article <vevag2$1ricg$1@paganini.bofh.team>

Deutsch English Français Italiano
<vevag2$1ricg$1@paganini.bofh.team>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.mixmin.net!weretis.net!feeder8.news.weretis.net!newsfeed.bofh.team!paganini.bofh.team!not-for-mail
From: antispam@fricas.org (Waldek Hebisch)
Newsgroups: comp.arch.embedded
Subject: Re: Diagnostics
Date: Sat, 19 Oct 2024 03:53:40 -0000 (UTC)
Organization: To protect and to server
Message-ID: <vevag2$1ricg$1@paganini.bofh.team>
References: <veekcp$9rsj$1@dont-email.me> <veuggc$1l5eo$1@paganini.bofh.team> <77k5hjprfq0ipjp6pcdd03lnph1i76ssuu@4ax.com> <veunj9$3gbqs$2@dont-email.me> <vev398$1r4v5$2@paganini.bofh.team> <vev635$3mf56$1@dont-email.me>
Injection-Date: Sat, 19 Oct 2024 03:53:40 -0000 (UTC)
Injection-Info: paganini.bofh.team; logging-data="1952144"; posting-host="WwiNTD3IIceGeoS5hCc4+A.user.paganini.bofh.team"; mail-complaints-to="usenet@bofh.team"; posting-account="9dIQLXBM7WM9KzA+yjdR4A";
User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (Linux/6.1.0-9-amd64 (x86_64))
X-Notice: Filtered by postfilter v. 0.9.3
Bytes: 5546
Lines: 95

Don Y <blockedofcourse@foo.invalid> wrote:
> On 10/18/2024 6:50 PM, Waldek Hebisch wrote:
>> Don Y <blockedofcourse@foo.invalid> wrote:
>>> On 10/18/2024 2:42 PM, George Neuner wrote:
>>>
>>>>   To ensure 100%
>>>> functionality at all times effectively requires use of redundant
>>>> hardware - which generally is too expensive for a non safety critical
>>>> device.
>>>
>>> Apparently, there is noise about incorporating such hardware into
>>> *automotive* designs (!).  I would have thought the time between
>>> POSTs would have rendered that largely ineffective.  OTOH, if
>>> you imagine a failure can occur ANY time, then "just after
>>> putting the car in gear" is as good (bad!) a time as any!
>> 
>> TI for several years has nice processors with two cores, which
>> are almost in sync, but one is something like one cycle behind
>> the other.  And there is circuitry to compare that both cores
>> produce the same result.  This does not cover failures of the
>> whole chip, but dramaticaly lowers chance of undetected erros due
>> to some transient condition.
> 
> The 4th bit in memory location XYZ has failed "stuck at zero".
> How are you going to detect that?

The chips that I mentioned use static memory with ECC.  Of course,
ECC circuitry may fail.  There may be error undetected by ECC.
The two cores may have the same error or comparison circuitry
may fail to detect the difference.  Each may happen, but each
is much less likely to happen than simple transient error.

> One of the FETs that controls the shifting of the automatic
> transmission as failed open.  How do you detect that /and recover
> from it/?

Detecting such thing looks easy.  Recovery is tricky, because
if you have spare FET and activate it there is good chance that
it will fail due to the same reason that the first FET failed.
OTOH, if you have propely designed circuit around the FET,
disturbance strong enough to kill the FET is likely to kill
the controller too.

> The camera/LIDAR that the self-drive feature uses is providing
> incorrect data...  etc.

Use 3 (or more) and voting.  Of course, this increases cost and one
have to judge if increase of cost is worth increase in safety
(in self-driving car using multiple sensors looks like no-brainer,
but if this is just an assist to increase driver comfort then
result may be different).

> There are innumerable failures that can occur to compromise
> the "system" and no *easy*/inexpensive/reliable way to detect
> and recover from *all* of them.

Sure.  But for common failures or serious failures having non-negligible
pobability redundancy may offer cheap way to increase reliability.

>> For critical functions a car could have 3 processors with
>> voting circuitry.  With separate chips this would be more expensive
>> than single processor, but increase of cost probably would be
>> negligible compared to cost of the whole car.  And when integrated
>> on a single chip cost difference would be tiny.
>> 
>> IIUC car controller may "reboot" during a ride.  Intead of
>> rebooting it could handle work to a backup controller.
> 
> How do you know the circuitry (and other mechanisms) that
> implement this hand-over are operational?

It does not matter if handover _always_ works.  What matter is
if system with handover has lower chance of failure than
system without handover.  Having statistics of actual failures
(which I do not have but manufacturers should have) and
after some testing one can estimate failure probablity of
different designs and possibly decide to use handover.

> It is VERY difficult to design reliable systems.  I am not
> attempting that.  Rather, I am trying to address the fact that
> the reassurances POST (and, at the user's perogative, BIST)
> are not guaranteed when a device runs "for long periods of time".

You may have tests essentially as part of normal operation.
Of course, if you have single-tasked design with a task which
must be "always" ready to respond, then running test becomes
more complicated.  But in most designs you can spare enough
time slots to run tests during normal operation.  Tests may
interfere with normal operation, but here we are in domain
specific teritory: sometimes result of operation give enough
assurance that device is operating correctly.  And if testing
for correct operation is impossible, then there is nothing to
do, I certainly do not promise to deliver impossible.

-- 
                              Waldek Hebisch