Article <uvs5eu$2g9e9$2@dont-email.me>

Deutsch English Français Italiano
<uvs5eu$2g9e9$2@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Don Y <blockedofcourse@foo.invalid>
Newsgroups: sci.electronics.design
Subject: Re: Predictive failures
Date: Thu, 18 Apr 2024 15:05:07 -0700
Organization: A noiseless patient Spider
Lines: 56
Message-ID: <uvs5eu$2g9e9$2@dont-email.me>
References: <uvjn74$d54b$1@dont-email.me> <uvrkki$2c9fl$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 19 Apr 2024 00:05:20 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="a10f2b25949bbce175e159a01160a168";
	logging-data="2631113"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18K+g3OajlH+F+H37S7kJ9Q"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.2.2
Cancel-Lock: sha1:DP/bQnSKDBMsUzA4cwIr/H3avnI=
In-Reply-To: <uvrkki$2c9fl$1@dont-email.me>
Content-Language: en-US
Bytes: 3693

On 4/18/2024 10:18 AM, Buzz McCool wrote:
> On 4/15/2024 10:13 AM, Don Y wrote:
>> Is there a general rule of thumb for signalling the likelihood of
>> an "imminent" (for some value of "imminent") hardware failure?
> 
> This reminded me of some past efforts in this area. It was never demonstrated 
> to me (given ample opportunity) that this technology actually worked on 
> intermittently failing hardware I had, so be cautious in applying it in any 
> future endeavors.

Intermittent failures are the bane of all designers.  Until something
is reliably observable, trying to address the problem is largely
wack-a-mole.

> https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf

Thanks for that.  I didn't find it in my collection so it's addition will
be welcome.

Sun has historically been aggressive in trying to increase availability,
especially on big iron.  In fact, such a "prediction" led me to discard
a small server, yesterday (no time to dick with failing hardware!).

I am now seeing similar features in Dell servers.  But, the *actual*
implementation details are always shrouded in mystery.

But, it is obvious (for "always on" systems) that there are many things
that can silently fail that will only manifest some time later -- if at
all and possibly complicated by other failures that may have been
precipitated by it.

Sorting out WHAT to monitor is the tricky part.  Then, having the
ability to watch for trends can give you an inkling that something is
headed in the wrong direction -- before it actually exceeds some
baked in "hard limit".

E.g., only the memory that you actively REFERENCE in a product is ever
checked for errors!  Bit rot may not be detected until some time after it
has occurred -- when you eventually access that memory (and the memory
controller throws an error).

This is paradoxically amusing; code to HANDLE errors is likely the least
accessed code in a product.  So, bit rot IN that code is more likely
to go unnoticed -- until it is referenced (by some error condition)
and the error event complicated by the attendant error in the handler!
The more reliable your code (fewer faults), the more uncertain you
will be of the handlers' abilities to address faults that DO manifest!

The same applies to secondary storage media.  How will you know if
some-rarely-accessed-file is intact and ready to be referenced
WHEN NEEDED -- if you aren't doing patrol reads/scrubbing to
verify that it is intact, NOW?

[One common flaw with RAID implementations and naive reliance on that
technology]