Path: ...!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Don Y <blockedofcourse@foo.invalid>
Newsgroups: comp.arch.embedded
Subject: Re: Diagnostics
Date: Thu, 24 Oct 2024 14:49:42 -0700
Organization: A noiseless patient Spider
Lines: 158
Message-ID: <vfefdu$2psko$2@dont-email.me>
References: <veekcp$9rsj$1@dont-email.me> <veuggc$1l5eo$1@paganini.bofh.team>
 <77k5hjprfq0ipjp6pcdd03lnph1i76ssuu@4ax.com> <veunj9$3gbqs$2@dont-email.me>
 <vev398$1r4v5$2@paganini.bofh.team> <vev635$3mf56$1@dont-email.me>
 <vevag2$1ricg$1@paganini.bofh.team> <vevbss$3mr5m$2@dont-email.me>
 <vfe1g0$3csph$1@paganini.bofh.team>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 24 Oct 2024 23:49:51 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="7d3883bb28e29a199c42566abe9b37f1";
	logging-data="2945688"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/iiwxpP9/MxyZVaubc9t7W"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.2.2
Cancel-Lock: sha1:U8S9w8ZVP7EbY1vTqd98PmKYV7o=
Content-Language: en-US
In-Reply-To: <vfe1g0$3csph$1@paganini.bofh.team>
Bytes: 8591

On 10/24/2024 10:52 AM, Waldek Hebisch wrote:
> Don Y <blockedofcourse@foo.invalid> wrote:
>> On 10/18/2024 8:53 PM, Waldek Hebisch wrote:
>>>> One of the FETs that controls the shifting of the automatic
>>>> transmission as failed open.  How do you detect that /and recover
>>>> from it/?
>>>
>>> Detecting such thing looks easy.  Recovery is tricky, because
>>> if you have spare FET and activate it there is good chance that
>>> it will fail due to the same reason that the first FET failed.
>>> OTOH, if you have propely designed circuit around the FET,
>>> disturbance strong enough to kill the FET is likely to kill
>>> the controller too.
>>
>> The immediate goal is to *detect* that a problem exists.
>> If you can't detect, then attempting to recover is a moot point.
> 
> In a car you have signals from wheels and engine, you can use
> those to compute transmission ratio and check is it is expected
> one.  Or simply have extra inputs which mountor FET output.

But a *user* can't do that.  They can only claim "something doesn't
feel right about the drive"...

So, if the controller doesn't do it, what recourse?

>>>> The camera/LIDAR that the self-drive feature uses is providing
>>>> incorrect data...  etc.
>>>
>>> Use 3 (or more) and voting.  Of course, this increases cost and one
>>> have to judge if increase of cost is worth increase in safety
>>
>> As well as the reliability of the additional "voting logic".
>> If not a set of binary signals, determining what the *correct*
>> signal may be can be problematic.
> 
> Matching images is now a stanard technology.  And in this case
> "voting logic" is likely to be software and main trouble are
> possible bugs.

The data must be available concurrently in order to "vote" on
them.  And, must be "close enough" to not consider them to differ.
For high reliability applications, you often *compute* the results
in different ways / algorithms -- to highlight any issues in
one implementation over the other.  So, the temporal path to
"their solutions" isn't the same.

>>> (in self-driving car using multiple sensors looks like no-brainer,
>>> but if this is just an assist to increase driver comfort then
>>> result may be different).
>>
>> It is different only in the sense of liability and exposure to
>> loss.  I am not assigning values to those consequences but,
>> rather, looking to address the issue of run-time testing, in
>> general.
> 
> I doubt in general solutions.  Various parts of your system
> may have enough common features to allow single strategy
> in your system.  But it is unlikely to generalize to other
> systems.  To put it differently, there are probabilites
> of various events and associated costs.  Even if you
> refuse to quantify probabilites and costs your design
> decisions (assuming they are rational) will give some
> estimate of them.

I've asked for other peoples' experiences.  I've not expected
them to have solved *my* problem.  Nor do I expect my solution
to solve theirs.  Likewise, why something like Linux wouldn't
have "the" solution.

>> Again, I am not interested in "recovery" as that varies with
>> the application and risk assessment.  What I want to concentrate
>> on is reliably *detecting* faults before they lead to product
>> failures.
>>
>> I contend that the hardware in many devices has that capability
>> (to some extent) but that it is underutilized; that the issue
>> of detecting faults *after* POST is one that doesn't see much
>> attention.  The likely thinking being that POST will flag it the
>> next time the device is restarted.
>>
>> And, that's not acceptable in long-running devices.
> 
> Well, you write that you do not try to build high reliablity
> device.  However device which correctly operates for years
> without interruption is considered "high availability" device
> which is a king of high reliablity.  And techniques for high
> reliablity seem appropiate here.

No.  Most devices can't afford the cost/complexity of a true
high reliability/redundant solution.

Your car has SOME redundancy in how it handles braking (two
chamber master cylinder plus "emergency/parking" brake PLUS
using the engine to slow the vehicle).  Yet, absolutely NO
protection against a catastrophic failure of the steering!

Redundancy in braking is relatively easy to provide -- esp
in the volumes produced.  So, adds little to the cost and
complexity of the vehicle.  Adding a redundant steering
mechanism... where would you even BEGIN to address that?

Cars are a great example of the tradeoffs involved.  You invest
a *little* to detect and report problems instead of a LOT to
continue operating in their presence.  Why not have duplicate
turn signal indicators (front, rear and side) to guard against
a bulb failure?  Much easier and cheaper to detect that a filament
has opened and report that to the driver (and HOPE he gets around
to fixing it).

If the driver can't be TOLD of faults and failures, then he
is in the dark as to how effectively his device is performing its
required actions.  "CHECK ENGINE" really does mean that the
engine *needs* attention.  What difference, "CHECK DRAM"?

>>>> It is VERY difficult to design reliable systems.  I am not
>>>> attempting that.  Rather, I am trying to address the fact that
>>>> the reassurances POST (and, at the user's perogative, BIST)
>>>> are not guaranteed when a device runs "for long periods of time".
>>>
>>> You may have tests essentially as part of normal operation.
>>
>> I suspect most folks have designed devices with UARTs.  And,
>> having written a driver for it, have noted that framing, parity
>> and overrun errors are possible.
>>
>> Ask yourself how many of those systems ever *use* that information!
>> Is there even a means of propagating it up out of the driver?
> 
> Well, I always use no parity transmission mode.  Standard way is
> to use checksums and acknowledgments.  That way you know if
> transmission is working correctly.  What extra info you expect
> from looking at detailed error info from UART?

That assumes you can control the messages exchanged.  If I
attach a TTY to the console -- routed through a serial port -- on
my computer, what should the checksum be when I see the "login: "
message?  When I type my name, what checksum should I append
to the identifier?

I.e., serial port protocols don't *require* these things.
If the computer sees "do~n" -- where '~' indicates an overrun
error in the preceeding character's reception -- it KNOWS
that I haven't typed exactly three characters:  d, o, n.
So, it shouldn't even ASK for my password, choosing, instead,
to reissue the login: banner (because it wouldn't know which
password to validate).

Likewise, if I saw "log~in: " on my TTY, I *know* that it
isn't saying "login: " because AT LEAST one character has
been omitted in that '~'.

This is easy to fix -- in ALL interactions.  But, requires
the driver to propagate these errors up the stack and the
application layer to act on them.  I.e., if the application
layer encounters lots of overrun (or partity/framing) errors,
SOMETHING is wrong with the link and/or the driver.