Deutsch   English   Français   Italiano  
<vev7cu$1rdu5$1@paganini.bofh.team>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!3.eu.feeder.erje.net!feeder.erje.net!newsfeed.bofh.team!paganini.bofh.team!not-for-mail
From: antispam@fricas.org (Waldek Hebisch)
Newsgroups: comp.arch.embedded
Subject: Re: Diagnostics
Date: Sat, 19 Oct 2024 03:00:48 -0000 (UTC)
Organization: To protect and to server
Message-ID: <vev7cu$1rdu5$1@paganini.bofh.team>
References: <veekcp$9rsj$1@dont-email.me> <veuggc$1l5eo$1@paganini.bofh.team> <veummc$3gbqs$1@dont-email.me>
Injection-Date: Sat, 19 Oct 2024 03:00:48 -0000 (UTC)
Injection-Info: paganini.bofh.team; logging-data="1947589"; posting-host="WwiNTD3IIceGeoS5hCc4+A.user.paganini.bofh.team"; mail-complaints-to="usenet@bofh.team"; posting-account="9dIQLXBM7WM9KzA+yjdR4A";
User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (Linux/6.1.0-9-amd64 (x86_64))
X-Notice: Filtered by postfilter v. 0.9.3
Bytes: 6378
Lines: 114

Don Y <blockedofcourse@foo.invalid> wrote:
> On 10/18/2024 1:30 PM, Waldek Hebisch wrote:
>> Don Y <blockedofcourse@foo.invalid> wrote:
>>> Typically, one performs some limited "confidence tests"
>>> at POST to catch gross failures.  As this activity is
>>> "in series" with normal operation, it tends to be brief
>>> and not very thorough.
>>>
>>> Many products offer a BIST capability that the user can invoke
>>> for more thorough testing.  This allows the user to decide
>>> when he can afford to live without the normal functioning of the
>>> device.
>>>
>>> And, if you are a "robust" designer, you often include invariants
>>> that verify hardware operations (esp to I/Os) are actually doing
>>> what they should -- e.g., verifying battery voltage increases
>>> when you activate the charging circuit, loopbacks on DIOs, etc.
>>>
>>> But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity.
>>> And, BIST might not always be convenient (as well as requiring the
>>> user's consent and participation).
>>>
>>> There, runtime diagnostics are the only alternative for hardware
>>> revalidation, PFA and diagnostics.
>>>
>>> How commonly are such mechanisms implemented?  And, how thoroughly?
>> 
>> This is strange question.  AFAIK automatically run diagnostics/checks
>> are part of safety regulations.
> 
> Not all devices are covered by "regulations".

Well, if device matters then there is implied liabilty
and nobody want to admit doing bad job.  If device
does not matter, then answer to the original question
also does not matter.

> And, the *extent* to which testing is done is the subject
> addressed; if I ensure "stuff" *WORKED* when the device was
> powered on (preventing it from continuing on to its normal
> functionality in the event that some failure was detected),
> what assurance does that give me that the device's integrity
> is still intact 8760 hours (1 yr) hours later?  720 hours
> (1 mo)?  168 hours (1 wk)?  24 hours?  *1* hour????

What to test is really domain-specific.  Traditional thinking
is that computer hardware is _much_ more reliable than
software and software bugs are major source of misbehaviour.
And among hardware failures transient upsets, like flipped
bit are more likely than permanent failure.  For example,
at low safety level you may assume that hardware of a counter
generating PWM-ed signal works correctly, but you are
supposed to periodically verify that configuration registers
keep expected values.  IIUC cristal osciators are likely to fail
so you are supposed to regularly check for presence of the clock
and its frequency (this assumes hardware design with a backup
clock).

>>  Even if some safety critical software
>> does not contain them, nobody is going to admit violationg regulations.
>> And things like PLC-s are "dual use", they may be used in non-safety
>> role, but vendors claim compliance to safety standards.
> 
> So, if a bit in a RAM in said device *dies* some time after power on,
> is the device going to *know* that has happened?  And, signal its
> unwillingness to continue operating?  What is going to detect that
> failure?

I do not know how PLC manufactures implement checks.  Small
PLC-s are based on MCU-s with static parity protected RAM.
This may be deemed adequate.  PLC-s work in cycles and some
percentage of the cycle is dedicated to self-test.  So big
PLC may divide memory into smallish regions and in each
cycle check a single region, walking trough whole memory.

> What if the bit's failure is inconsequential to the operation
> of the device?  E.g., if the bit is part of some not-used
> feature?  *Or*, if it has failed in the state it was *supposed*
> to be in??!

I am affraid that usually inconsequential failure gets
promoted to complete failure.  Before 2000 checking showed
that several BIOS-es "validated" date and "incorrect" (that
is after 1999) date prevented boot.

Historically OS-es had a map of bad blocks on the disc and
avoided allocating them.  In principle on system with paging
hardware the same could be done for DRAM, but I do not think
anybody is doing this (if domain is serious enough to worry
about DRAM failures, then it probaly have redundant independent
computers with ECC DRAM).

> With a "good" POST design, you can reassure the user that the
> device *appears* to be functional.  That the data/code stored in it
> are intact (since last time they were accessed).  That the memory
> is capable of storing any values that is called on to preserve.
> That the hardware I/Os can control and sense as intended, etc.
> 
> /But, you have no guarantee that this condition will persist!/
> If it WAS guaranteed to persist, then the simple way to make high
> reliability devices would be just to /never turn them off/ to
> take advantage of this "guarantee"!

Everything here is domain specific.  In cheap MCU-based device main
source of failurs is overvoltage/ESD on MCU pins.  This may
kill the whole chip in which case no software protection can
help.  Or some pins fail, sometimes this may be detected by reading
appropiate port.  If you control electic motor then you probably
do not want to sent test signals during normal motor operation.
But you are likely to have some feedback and can verify if feedback
agrees with expected values.  If you get unexpected readings
you probably will stop the motor.

-- 
                              Waldek Hebisch