Deutsch English Français Italiano |
<vev7cu$1rdu5$1@paganini.bofh.team> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!3.eu.feeder.erje.net!feeder.erje.net!newsfeed.bofh.team!paganini.bofh.team!not-for-mail From: antispam@fricas.org (Waldek Hebisch) Newsgroups: comp.arch.embedded Subject: Re: Diagnostics Date: Sat, 19 Oct 2024 03:00:48 -0000 (UTC) Organization: To protect and to server Message-ID: <vev7cu$1rdu5$1@paganini.bofh.team> References: <veekcp$9rsj$1@dont-email.me> <veuggc$1l5eo$1@paganini.bofh.team> <veummc$3gbqs$1@dont-email.me> Injection-Date: Sat, 19 Oct 2024 03:00:48 -0000 (UTC) Injection-Info: paganini.bofh.team; logging-data="1947589"; posting-host="WwiNTD3IIceGeoS5hCc4+A.user.paganini.bofh.team"; mail-complaints-to="usenet@bofh.team"; posting-account="9dIQLXBM7WM9KzA+yjdR4A"; User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (Linux/6.1.0-9-amd64 (x86_64)) X-Notice: Filtered by postfilter v. 0.9.3 Bytes: 6378 Lines: 114 Don Y <blockedofcourse@foo.invalid> wrote: > On 10/18/2024 1:30 PM, Waldek Hebisch wrote: >> Don Y <blockedofcourse@foo.invalid> wrote: >>> Typically, one performs some limited "confidence tests" >>> at POST to catch gross failures. As this activity is >>> "in series" with normal operation, it tends to be brief >>> and not very thorough. >>> >>> Many products offer a BIST capability that the user can invoke >>> for more thorough testing. This allows the user to decide >>> when he can afford to live without the normal functioning of the >>> device. >>> >>> And, if you are a "robust" designer, you often include invariants >>> that verify hardware operations (esp to I/Os) are actually doing >>> what they should -- e.g., verifying battery voltage increases >>> when you activate the charging circuit, loopbacks on DIOs, etc. >>> >>> But, for 24/7/365 boxes, POST is a "once-in-a-lifetime" activity. >>> And, BIST might not always be convenient (as well as requiring the >>> user's consent and participation). >>> >>> There, runtime diagnostics are the only alternative for hardware >>> revalidation, PFA and diagnostics. >>> >>> How commonly are such mechanisms implemented? And, how thoroughly? >> >> This is strange question. AFAIK automatically run diagnostics/checks >> are part of safety regulations. > > Not all devices are covered by "regulations". Well, if device matters then there is implied liabilty and nobody want to admit doing bad job. If device does not matter, then answer to the original question also does not matter. > And, the *extent* to which testing is done is the subject > addressed; if I ensure "stuff" *WORKED* when the device was > powered on (preventing it from continuing on to its normal > functionality in the event that some failure was detected), > what assurance does that give me that the device's integrity > is still intact 8760 hours (1 yr) hours later? 720 hours > (1 mo)? 168 hours (1 wk)? 24 hours? *1* hour???? What to test is really domain-specific. Traditional thinking is that computer hardware is _much_ more reliable than software and software bugs are major source of misbehaviour. And among hardware failures transient upsets, like flipped bit are more likely than permanent failure. For example, at low safety level you may assume that hardware of a counter generating PWM-ed signal works correctly, but you are supposed to periodically verify that configuration registers keep expected values. IIUC cristal osciators are likely to fail so you are supposed to regularly check for presence of the clock and its frequency (this assumes hardware design with a backup clock). >> Even if some safety critical software >> does not contain them, nobody is going to admit violationg regulations. >> And things like PLC-s are "dual use", they may be used in non-safety >> role, but vendors claim compliance to safety standards. > > So, if a bit in a RAM in said device *dies* some time after power on, > is the device going to *know* that has happened? And, signal its > unwillingness to continue operating? What is going to detect that > failure? I do not know how PLC manufactures implement checks. Small PLC-s are based on MCU-s with static parity protected RAM. This may be deemed adequate. PLC-s work in cycles and some percentage of the cycle is dedicated to self-test. So big PLC may divide memory into smallish regions and in each cycle check a single region, walking trough whole memory. > What if the bit's failure is inconsequential to the operation > of the device? E.g., if the bit is part of some not-used > feature? *Or*, if it has failed in the state it was *supposed* > to be in??! I am affraid that usually inconsequential failure gets promoted to complete failure. Before 2000 checking showed that several BIOS-es "validated" date and "incorrect" (that is after 1999) date prevented boot. Historically OS-es had a map of bad blocks on the disc and avoided allocating them. In principle on system with paging hardware the same could be done for DRAM, but I do not think anybody is doing this (if domain is serious enough to worry about DRAM failures, then it probaly have redundant independent computers with ECC DRAM). > With a "good" POST design, you can reassure the user that the > device *appears* to be functional. That the data/code stored in it > are intact (since last time they were accessed). That the memory > is capable of storing any values that is called on to preserve. > That the hardware I/Os can control and sense as intended, etc. > > /But, you have no guarantee that this condition will persist!/ > If it WAS guaranteed to persist, then the simple way to make high > reliability devices would be just to /never turn them off/ to > take advantage of this "guarantee"! Everything here is domain specific. In cheap MCU-based device main source of failurs is overvoltage/ESD on MCU pins. This may kill the whole chip in which case no software protection can help. Or some pins fail, sometimes this may be detected by reading appropiate port. If you control electic motor then you probably do not want to sent test signals during normal motor operation. But you are likely to have some feedback and can verify if feedback agrees with expected values. If you get unexpected readings you probably will stop the motor. -- Waldek Hebisch