Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connectionsPath: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Don Y Newsgroups: sci.electronics.design Subject: Re: Predictive failures Date: Thu, 18 Apr 2024 15:05:07 -0700 Organization: A noiseless patient Spider Lines: 56 Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Fri, 19 Apr 2024 00:05:20 +0200 (CEST) Injection-Info: dont-email.me; posting-host="a10f2b25949bbce175e159a01160a168"; logging-data="2631113"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18K+g3OajlH+F+H37S7kJ9Q" User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.2.2 Cancel-Lock: sha1:DP/bQnSKDBMsUzA4cwIr/H3avnI= In-Reply-To: Content-Language: en-US Bytes: 3693 On 4/18/2024 10:18 AM, Buzz McCool wrote: > On 4/15/2024 10:13 AM, Don Y wrote: >> Is there a general rule of thumb for signalling the likelihood of >> an "imminent" (for some value of "imminent") hardware failure? > > This reminded me of some past efforts in this area. It was never demonstrated > to me (given ample opportunity) that this technology actually worked on > intermittently failing hardware I had, so be cautious in applying it in any > future endeavors. Intermittent failures are the bane of all designers. Until something is reliably observable, trying to address the problem is largely wack-a-mole. > https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf Thanks for that. I didn't find it in my collection so it's addition will be welcome. Sun has historically been aggressive in trying to increase availability, especially on big iron. In fact, such a "prediction" led me to discard a small server, yesterday (no time to dick with failing hardware!). I am now seeing similar features in Dell servers. But, the *actual* implementation details are always shrouded in mystery. But, it is obvious (for "always on" systems) that there are many things that can silently fail that will only manifest some time later -- if at all and possibly complicated by other failures that may have been precipitated by it. Sorting out WHAT to monitor is the tricky part. Then, having the ability to watch for trends can give you an inkling that something is headed in the wrong direction -- before it actually exceeds some baked in "hard limit". E.g., only the memory that you actively REFERENCE in a product is ever checked for errors! Bit rot may not be detected until some time after it has occurred -- when you eventually access that memory (and the memory controller throws an error). This is paradoxically amusing; code to HANDLE errors is likely the least accessed code in a product. So, bit rot IN that code is more likely to go unnoticed -- until it is referenced (by some error condition) and the error event complicated by the attendant error in the handler! The more reliable your code (fewer faults), the more uncertain you will be of the handlers' abilities to address faults that DO manifest! The same applies to secondary storage media. How will you know if some-rarely-accessed-file is intact and ready to be referenced WHEN NEEDED -- if you aren't doing patrol reads/scrubbing to verify that it is intact, NOW? [One common flaw with RAID implementations and naive reliance on that technology]