Path: ...!Xl.tags.giganews.com!local-2.nntp.ord.giganews.com!nntp.supernews.com!news.supernews.com.POSTED!not-for-mail NNTP-Posting-Date: Fri, 19 Apr 2024 18:16:03 +0000 From: boB Newsgroups: sci.electronics.design Subject: Re: Predictive failures Date: Fri, 19 Apr 2024 11:16:02 -0700 Message-ID: References: User-Agent: ForteAgent/8.00.32.1272 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lines: 69 X-Trace: sv3-gvMXbOqsmvYlBzJLzqJP+OseJz4l84NudsH3WceaqG7Z4+szWSl1ybE9UlSBD62TzulNFugWmSBafYV!Xztg8+BiKAiRq+1tv/JEiKamZyfvv8kTc8KJUQdYK4rw4BzdEj2n7NI/xcequr4bgEuTj+Z/ZvVA!xybGXZk= X-Complaints-To: www.supernews.com/docs/abuse.html X-DMCA-Complaints-To: www.supernews.com/docs/dmca.html X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.40 Bytes: 3948 On Thu, 18 Apr 2024 15:05:07 -0700, Don Y wrote: >On 4/18/2024 10:18 AM, Buzz McCool wrote: >> On 4/15/2024 10:13 AM, Don Y wrote: >>> Is there a general rule of thumb for signalling the likelihood of >>> an "imminent" (for some value of "imminent") hardware failure? >> >> This reminded me of some past efforts in this area. It was never demonstrated >> to me (given ample opportunity) that this technology actually worked on >> intermittently failing hardware I had, so be cautious in applying it in any >> future endeavors. > >Intermittent failures are the bane of all designers. Until something >is reliably observable, trying to address the problem is largely >wack-a-mole. > The problem I have with troubleshooting intermittent failures is that they are only intermittend sometimes. >> https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf > >Thanks for that. I didn't find it in my collection so it's addition will >be welcome. Yes, neat paper. boB > >Sun has historically been aggressive in trying to increase availability, >especially on big iron. In fact, such a "prediction" led me to discard >a small server, yesterday (no time to dick with failing hardware!). > >I am now seeing similar features in Dell servers. But, the *actual* >implementation details are always shrouded in mystery. > >But, it is obvious (for "always on" systems) that there are many things >that can silently fail that will only manifest some time later -- if at >all and possibly complicated by other failures that may have been >precipitated by it. > >Sorting out WHAT to monitor is the tricky part. Then, having the >ability to watch for trends can give you an inkling that something is >headed in the wrong direction -- before it actually exceeds some >baked in "hard limit". > >E.g., only the memory that you actively REFERENCE in a product is ever >checked for errors! Bit rot may not be detected until some time after it >has occurred -- when you eventually access that memory (and the memory >controller throws an error). > >This is paradoxically amusing; code to HANDLE errors is likely the least >accessed code in a product. So, bit rot IN that code is more likely >to go unnoticed -- until it is referenced (by some error condition) >and the error event complicated by the attendant error in the handler! >The more reliable your code (fewer faults), the more uncertain you >will be of the handlers' abilities to address faults that DO manifest! > >The same applies to secondary storage media. How will you know if >some-rarely-accessed-file is intact and ready to be referenced >WHEN NEEDED -- if you aren't doing patrol reads/scrubbing to >verify that it is intact, NOW? > >[One common flaw with RAID implementations and naive reliance on that >technology]