Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Martin Brown <'''newspam'''@nonad.co.uk> Newsgroups: sci.electronics.design Subject: Re: Predictive failures Date: Tue, 16 Apr 2024 11:46:20 +0100 Organization: A noiseless patient Spider Lines: 252 Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Tue, 16 Apr 2024 12:46:21 +0200 (CEST) Injection-Info: dont-email.me; posting-host="24ee2e361efeca0239e60e430996748a"; logging-data="960928"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/m8R3spCjCJP7SMARAWu8b6N2xqU/qu2Qq9RS5drCWwQ==" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:xn8qWErodZJ79TsbvPKW7lgyuAc= In-Reply-To: Content-Language: en-GB Bytes: 11754 On 16/04/2024 05:14, Edward Rawde wrote: > "Don Y" wrote in message > news:uvkqqu$o5co$1@dont-email.me... >> On 4/15/2024 1:32 PM, Edward Rawde wrote: >>> "Don Y" wrote in message >>> news:uvjn74$d54b$1@dont-email.me... >>>> Is there a general rule of thumb for signalling the likelihood of >>>> an "imminent" (for some value of "imminent") hardware failure? >>> >>> My conclusion would be no. >>> Some of my reasons are given below. >>> >>> It always puzzled me how HAL could know that the AE-35 would fail in the >>> near future, but maybe HAL had a motive for lying. >> >> Why does your PC retry failed disk operations? > > Because the software designer didn't understand hardware. > The correct approach is to mark that part of the disk as unusable and, if > possible, move any data from it elsewhere quick. > >> If I ask the drive to give >> me LBA 1234, shouldn't it ALWAYS give me LBA1234? Without any data >> corruption >> (CRC error) AND within the normal access time limits defined by the >> location >> of those magnetic domains on the rotating medium? >> >> Why should it attempt to retry this MORE than once? >> >> Now, if you knew your disk drive was repeatedly retrying operations, >> would your confidence in it be unchanged from times when it did not >> exhibit such behavior? > > I'd have put an SSD in by now, along with an off site backup of the same > data :) > >> >> Assuming you have properly configured a EIA232 interface, why would you >> ever get a parity error? (OVERRUN errors can be the result of an i/f >> that is running too fast for the system on the receiving end) How would >> you even KNOW this was happening? >> >> I suspect everyone who has owned a DVD/CD drive has encountered a >> "slow tray" as the mechanism aged. Or, a tray that wouldn't >> open (of its own accord) as soon/quickly as it used to. > > If it hasn't been used for some time then I'm ready with a tiny screwdriver > blade to help it open. > But I forget when I last used an optical drive. > >> >> The controller COULD be watching this (cuz it knows when it >> initiated the operation and there is an "end-of-stroke" >> sensor available) and KNOW that the drive belt was stretching >> to the point where it was impacting operation. >> >> [And, that a stretched belt wasn't going to suddenly decide to >> unstretch to fix the problem!] >> >>> Back in that era I was doing a lot of repair work when I should have been >>> doing my homework. >>> So I knew that there were many unrelated kinds of hardware failure. >> >> The goal isn't to predict ALL failures but, rather, to anticipate >> LIKELY failures and treat them before they become an inconvenience >> (or worse). >> >> One morning, the (gas) furnace repeatedly tried to light as the >> thermostat called for heat. Then, a few moments later, the >> safeties would kick in and shut down the gas flow. This attracted my >> attention as the LIT furnace should STAY LIT! >> >> The furnace was too stupid to notice its behavior so would repeat >> this cycle, endlessly. >> >> I stepped in and overrode the thermostat to eliminate the call >> for heat as this behavior couldn't be productive (if something >> truly IS wrong, then why let it continue? and, if there is nothing >> wrong with the controls/mechanism, then clearly it is unable to meet >> my needs so why let it persist in trying?) >> >> [Turns out, there was a city-wide gas shortage so there was enough >> gas available to light the furnace but not enough to bring it up to >> temperature as quickly as the designers had expected] > > That's why the furnace designers couldn't have anticipated it. > They did not know that such a contition might occur so never tested for it. > >> >>> A component could fail suddenly, such as a short circuit diode, and >>> everything would work fine after replacing it. >>> The cause could perhaps have been a manufacturing defect, such as >>> insufficient cooling due to poor quality assembly, but the exact real >>> cause >>> would never be known. >> >> You don't care about the real cause. Or, even the failure mode. >> You (as user) just don't want to be inconvenienced by the sudden >> loss of the functionality/convenience that the the device provided. > > There will always be sudden unexpected loss of functionality for reasons > which could not easily be predicted. > People who service lawn mowers in the area where I live are very busy right > now. > >> >>> A component could fail suddenly as a side effect of another failure. >>> One short circuit output transistor and several other components could >>> also >>> burn up. >> >> So, if you could predict the OTHER failure... >> Or, that such a failure might occur and lead to the followup failure... >> >>> A component could fail slowly and only become apparent when it got to the >>> stage of causing an audible or visible effect. >> >> But, likely, there was something observable *in* the circuit that >> just hadn't made it to the level of human perception. > > Yes a power supply ripple detection circuit could have turned on a warning > LED but that never happened for at least two reasons. > 1. The detection circuit would have increased the cost of the equipment and > thus diminished the profit of the manufacturer. > 2. The user would not have understood and would have ignored the warning > anyway. > >> >>> It would often be easy to locate the dried up electrolytic due to it >>> having >>> already let go of some of its contents. >>> >>> So I concluded that if I wanted to be sure that I could always watch my >>> favourite TV show, we would have to have at least two TVs in the house. >>> >>> If it's not possible to have the equivalent of two TVs then you will want >>> to >>> be in a position to get the existing TV repaired or replaced as quicky as >>> possible. >> >> Two TVs are affordable. Consider two controllers for a wire-EDM machine. >> >> Or, the cost of having that wire-EDM machine *idle* (because you didn't >> have a spare controller!) >> >>> My home wireless Internet system doesn't care if one access point fails, >>> and >>> I would not expect to be able to do anything to predict a time of >>> failure. >>> Experience says a dead unit has power supply issues. Usually external but >>> could be internal. >> >> Again, the goal isn't to predict "time of failure". But, rather, to be >> able to know that "this isn't going to end well" -- with some advance >> notice >> that allows for preemptive action to be taken (and not TOO much advance >> notice that the user ends up replacing items prematurely). > > Get feedback from the people who use your equpment. > >> >>> I don't think it would be possible to "watch" everything because it's >>> rare >>> that you can properly test a component while it's part of a working >>> system. >> >> You don't have to -- as long as you can observe its effects on other >> parts of the system. E.g., there's no easy/inexpensive way to >> check to see how much the belt on that CD/DVD player has stretched. >> But, you can notice that it HAS stretched (or, some less likely >> change has occurred that similarly interferes with the tray's actions) >> by noting how the activity that it is used for has changed. > > Sure but you have to be the operator for that. > So you can be ready to help the tray open when needed. ========== REMAINDER OF ARTICLE TRUNCATED ==========