Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Martin Brown <'''newspam'''@nonad.co.uk>
Newsgroups: sci.electronics.design
Subject: Re: Predictive failures
Date: Tue, 16 Apr 2024 11:46:20 +0100
Organization: A noiseless patient Spider
Lines: 252
Message-ID: <uvlktt$tad0$1@dont-email.me>
References: <uvjn74$d54b$1@dont-email.me>
 <uvk2sk$1p01$1@nnrp.usenet.blueworldhosting.com>
 <uvkqqu$o5co$1@dont-email.me>
 <uvktun$2kjj$1@nnrp.usenet.blueworldhosting.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 16 Apr 2024 12:46:21 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="24ee2e361efeca0239e60e430996748a";
	logging-data="960928"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/m8R3spCjCJP7SMARAWu8b6N2xqU/qu2Qq9RS5drCWwQ=="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:xn8qWErodZJ79TsbvPKW7lgyuAc=
In-Reply-To: <uvktun$2kjj$1@nnrp.usenet.blueworldhosting.com>
Content-Language: en-GB
Bytes: 11754

On 16/04/2024 05:14, Edward Rawde wrote:
> "Don Y" <blockedofcourse@foo.invalid> wrote in message
> news:uvkqqu$o5co$1@dont-email.me...
>> On 4/15/2024 1:32 PM, Edward Rawde wrote:
>>> "Don Y" <blockedofcourse@foo.invalid> wrote in message
>>> news:uvjn74$d54b$1@dont-email.me...
>>>> Is there a general rule of thumb for signalling the likelihood of
>>>> an "imminent" (for some value of "imminent") hardware failure?
>>>
>>> My conclusion would be no.
>>> Some of my reasons are given below.
>>>
>>> It always puzzled me how HAL could know that the AE-35 would fail in the
>>> near future, but maybe HAL had a motive for lying.
>>
>> Why does your PC retry failed disk operations?
> 
> Because the software designer didn't understand hardware.
> The correct approach is to mark that part of the disk as unusable and, if
> possible, move any data from it elsewhere quick.
> 
>> If I ask the drive to give
>> me LBA 1234, shouldn't it ALWAYS give me LBA1234?  Without any data
>> corruption
>> (CRC error) AND within the normal access time limits defined by the
>> location
>> of those magnetic domains on the rotating medium?
>>
>> Why should it attempt to retry this MORE than once?
>>
>> Now, if you knew your disk drive was repeatedly retrying operations,
>> would your confidence in it be unchanged from times when it did not
>> exhibit such behavior?
> 
> I'd have put an SSD in by now, along with an off site backup of the same
> data :)
> 
>>
>> Assuming you have properly configured a EIA232 interface, why would you
>> ever get a parity error?  (OVERRUN errors can be the result of an i/f
>> that is running too fast for the system on the receiving end)  How would
>> you even KNOW this was happening?
>>
>> I suspect everyone who has owned a DVD/CD drive has encountered a
>> "slow tray" as the mechanism aged.  Or, a tray that wouldn't
>> open (of its own accord) as soon/quickly as it used to.
> 
> If it hasn't been used for some time then I'm ready with a tiny screwdriver
> blade to help it open.
> But I forget when I last used an optical drive.
> 
>>
>> The controller COULD be watching this (cuz it knows when it
>> initiated the operation and there is an "end-of-stroke"
>> sensor available) and KNOW that the drive belt was stretching
>> to the point where it was impacting operation.
>>
>> [And, that a stretched belt wasn't going to suddenly decide to
>> unstretch to fix the problem!]
>>
>>> Back in that era I was doing a lot of repair work when I should have been
>>> doing my homework.
>>> So I knew that there were many unrelated kinds of hardware failure.
>>
>> The goal isn't to predict ALL failures but, rather, to anticipate
>> LIKELY failures and treat them before they become an inconvenience
>> (or worse).
>>
>> One morning, the (gas) furnace repeatedly tried to light as the
>> thermostat called for heat.  Then, a few moments later, the
>> safeties would kick in and shut down the gas flow.  This attracted my
>> attention as the LIT furnace should STAY LIT!
>>
>> The furnace was too stupid to notice its behavior so would repeat
>> this cycle, endlessly.
>>
>> I stepped in and overrode the thermostat to eliminate the call
>> for heat as this behavior couldn't be productive (if something
>> truly IS wrong, then why let it continue?  and, if there is nothing
>> wrong with the controls/mechanism, then clearly it is unable to meet
>> my needs so why let it persist in trying?)
>>
>> [Turns out, there was a city-wide gas shortage so there was enough
>> gas available to light the furnace but not enough to bring it up to
>> temperature as quickly as the designers had expected]
> 
> That's why the furnace designers couldn't have anticipated it.
> They did not know that such a contition might occur so never tested for it.
> 
>>
>>> A component could fail suddenly, such as a short circuit diode, and
>>> everything would work fine after replacing it.
>>> The cause could perhaps have been a manufacturing defect, such as
>>> insufficient cooling due to poor quality assembly, but the exact real
>>> cause
>>> would never be known.
>>
>> You don't care about the real cause.  Or, even the failure mode.
>> You (as user) just don't want to be inconvenienced by the sudden
>> loss of the functionality/convenience that the the device provided.
> 
> There will always be sudden unexpected loss of functionality for reasons
> which could not easily be predicted.
> People who service lawn mowers in the area where I live are very busy right
> now.
> 
>>
>>> A component could fail suddenly as a side effect of another failure.
>>> One short circuit output transistor and several other components could
>>> also
>>> burn up.
>>
>> So, if you could predict the OTHER failure...
>> Or, that such a failure might occur and lead to the followup failure...
>>
>>> A component could fail slowly and only become apparent when it got to the
>>> stage of causing an audible or visible effect.
>>
>> But, likely, there was something observable *in* the circuit that
>> just hadn't made it to the level of human perception.
> 
> Yes a power supply ripple detection circuit could have turned on a warning
> LED but that never happened for at least two reasons.
> 1. The detection circuit would have increased the cost of the equipment and
> thus diminished the profit of the manufacturer.
> 2. The user would not have understood and would have ignored the warning
> anyway.
> 
>>
>>> It would often be easy to locate the dried up electrolytic due to it
>>> having
>>> already let go of some of its contents.
>>>
>>> So I concluded that if I wanted to be sure that I could always watch my
>>> favourite TV show, we would have to have at least two TVs in the house.
>>>
>>> If it's not possible to have the equivalent of two TVs then you will want
>>> to
>>> be in a position to get the existing TV repaired or replaced as quicky as
>>> possible.
>>
>> Two TVs are affordable.  Consider two controllers for a wire-EDM machine.
>>
>> Or, the cost of having that wire-EDM machine *idle* (because you didn't
>> have a spare controller!)
>>
>>> My home wireless Internet system doesn't care if one access point fails,
>>> and
>>> I would not expect to be able to do anything to predict a time of
>>> failure.
>>> Experience says a dead unit has power supply issues. Usually external but
>>> could be internal.
>>
>> Again, the goal isn't to predict "time of failure".  But, rather, to be
>> able to know that "this isn't going to end well" -- with some advance
>> notice
>> that allows for preemptive action to be taken (and not TOO much advance
>> notice that the user ends up replacing items prematurely).
> 
> Get feedback from the people who use your equpment.
> 
>>
>>> I don't think it would be possible to "watch" everything because it's
>>> rare
>>> that you can properly test a component while it's part of a working
>>> system.
>>
>> You don't have to -- as long as you can observe its effects on other
>> parts of the system.  E.g., there's no easy/inexpensive way to
>> check to see how much the belt on that CD/DVD player has stretched.
>> But, you can notice that it HAS stretched (or, some less likely
>> change has occurred that similarly interferes with the tray's actions)
>> by noting how the activity that it is used for has changed.
> 
> Sure but you have to be the operator for that.
> So you can be ready to help the tray open when needed.
========== REMAINDER OF ARTICLE TRUNCATED ==========