Path: ...!Xl.tags.giganews.com!local-2.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail
NNTP-Posting-Date: Fri, 19 Apr 2024 18:16:03 +0000
From: boB <boB@K7IQ.com>
Newsgroups: sci.electronics.design
Subject: Re: Predictive failures
Date: Fri, 19 Apr 2024 11:16:02 -0700
Message-ID: <v4d52jtf67qnie0d7kk7evfjakmcvtoo07@4ax.com>
References: <uvjn74$d54b$1@dont-email.me> <uvrkki$2c9fl$1@dont-email.me> <uvs5eu$2g9e9$2@dont-email.me>
User-Agent: ForteAgent/8.00.32.1272
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Lines: 69
X-Trace: sv3-gvMXbOqsmvYlBzJLzqJP+OseJz4l84NudsH3WceaqG7Z4+szWSl1ybE9UlSBD62TzulNFugWmSBafYV!Xztg8+BiKAiRq+1tv/JEiKamZyfvv8kTc8KJUQdYK4rw4BzdEj2n7NI/xcequr4bgEuTj+Z/ZvVA!xybGXZk=
X-Complaints-To: www.supernews.com/docs/abuse.html
X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
Bytes: 3948

On Thu, 18 Apr 2024 15:05:07 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 4/18/2024 10:18 AM, Buzz McCool wrote:
>> On 4/15/2024 10:13 AM, Don Y wrote:
>>> Is there a general rule of thumb for signalling the likelihood of
>>> an "imminent" (for some value of "imminent") hardware failure?
>> 
>> This reminded me of some past efforts in this area. It was never demonstrated 
>> to me (given ample opportunity) that this technology actually worked on 
>> intermittently failing hardware I had, so be cautious in applying it in any 
>> future endeavors.
>
>Intermittent failures are the bane of all designers.  Until something
>is reliably observable, trying to address the problem is largely
>wack-a-mole.
>

The problem I have with troubleshooting intermittent failures is that
they are only intermittend sometimes.


>> https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf
>
>Thanks for that.  I didn't find it in my collection so it's addition will
>be welcome.

Yes, neat paper.

boB


>
>Sun has historically been aggressive in trying to increase availability,
>especially on big iron.  In fact, such a "prediction" led me to discard
>a small server, yesterday (no time to dick with failing hardware!).
>
>I am now seeing similar features in Dell servers.  But, the *actual*
>implementation details are always shrouded in mystery.
>
>But, it is obvious (for "always on" systems) that there are many things
>that can silently fail that will only manifest some time later -- if at
>all and possibly complicated by other failures that may have been
>precipitated by it.
>
>Sorting out WHAT to monitor is the tricky part.  Then, having the
>ability to watch for trends can give you an inkling that something is
>headed in the wrong direction -- before it actually exceeds some
>baked in "hard limit".
>
>E.g., only the memory that you actively REFERENCE in a product is ever
>checked for errors!  Bit rot may not be detected until some time after it
>has occurred -- when you eventually access that memory (and the memory
>controller throws an error).
>
>This is paradoxically amusing; code to HANDLE errors is likely the least
>accessed code in a product.  So, bit rot IN that code is more likely
>to go unnoticed -- until it is referenced (by some error condition)
>and the error event complicated by the attendant error in the handler!
>The more reliable your code (fewer faults), the more uncertain you
>will be of the handlers' abilities to address faults that DO manifest!
>
>The same applies to secondary storage media.  How will you know if
>some-rarely-accessed-file is intact and ready to be referenced
>WHEN NEEDED -- if you aren't doing patrol reads/scrubbing to
>verify that it is intact, NOW?
>
>[One common flaw with RAID implementations and naive reliance on that
>technology]