Article <uvktun$2kjj$1@nnrp.usenet.blueworldhosting.com>

Deutsch English Français Italiano
<uvktun$2kjj$1@nnrp.usenet.blueworldhosting.com>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.misty.com!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!nnrp.usenet.blueworldhosting.com!.POSTED!not-for-mail
From: "Edward Rawde" <invalid@invalid.invalid>
Newsgroups: sci.electronics.design
Subject: Re: Predictive failures
Date: Tue, 16 Apr 2024 00:14:13 -0400
Organization: BWH Usenet Archive (https://usenet.blueworldhosting.com)
Lines: 247
Message-ID: <uvktun$2kjj$1@nnrp.usenet.blueworldhosting.com>
References: <uvjn74$d54b$1@dont-email.me> <uvk2sk$1p01$1@nnrp.usenet.blueworldhosting.com> <uvkqqu$o5co$1@dont-email.me>
Injection-Date: Tue, 16 Apr 2024 04:14:15 -0000 (UTC)
Injection-Info: nnrp.usenet.blueworldhosting.com;
	logging-data="86643"; mail-complaints-to="usenet@blueworldhosting.com"
Cancel-Lock: sha1:DI85A2uKP7OK2Ks//uDXOgXGqJ8= sha256:8IOnhv+PUTACyWcZgrk0/8VTn7hPfizXaroOnBYb80w=
	sha1:4yw2yxZVv7xNAqRYzPJpvEko3Yo= sha256:9aw9cdGCz+cBdfhqJgDfGOW+XsVk140nmisHWvWCK+Q=
X-Newsreader: Microsoft Outlook Express 6.00.2900.5931
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
X-MSMail-Priority: Normal
X-Priority: 3
X-RFC2646: Format=Flowed; Response
Bytes: 11471

"Don Y" <blockedofcourse@foo.invalid> wrote in message 
news:uvkqqu$o5co$1@dont-email.me...
> On 4/15/2024 1:32 PM, Edward Rawde wrote:
>> "Don Y" <blockedofcourse@foo.invalid> wrote in message
>> news:uvjn74$d54b$1@dont-email.me...
>>> Is there a general rule of thumb for signalling the likelihood of
>>> an "imminent" (for some value of "imminent") hardware failure?
>>
>> My conclusion would be no.
>> Some of my reasons are given below.
>>
>> It always puzzled me how HAL could know that the AE-35 would fail in the
>> near future, but maybe HAL had a motive for lying.
>
> Why does your PC retry failed disk operations?

Because the software designer didn't understand hardware.
The correct approach is to mark that part of the disk as unusable and, if 
possible, move any data from it elsewhere quick.

> If I ask the drive to give
> me LBA 1234, shouldn't it ALWAYS give me LBA1234?  Without any data 
> corruption
> (CRC error) AND within the normal access time limits defined by the 
> location
> of those magnetic domains on the rotating medium?
>
> Why should it attempt to retry this MORE than once?
>
> Now, if you knew your disk drive was repeatedly retrying operations,
> would your confidence in it be unchanged from times when it did not
> exhibit such behavior?

I'd have put an SSD in by now, along with an off site backup of the same 
data :)

>
> Assuming you have properly configured a EIA232 interface, why would you
> ever get a parity error?  (OVERRUN errors can be the result of an i/f
> that is running too fast for the system on the receiving end)  How would
> you even KNOW this was happening?
>
> I suspect everyone who has owned a DVD/CD drive has encountered a
> "slow tray" as the mechanism aged.  Or, a tray that wouldn't
> open (of its own accord) as soon/quickly as it used to.

If it hasn't been used for some time then I'm ready with a tiny screwdriver 
blade to help it open.
But I forget when I last used an optical drive.

>
> The controller COULD be watching this (cuz it knows when it
> initiated the operation and there is an "end-of-stroke"
> sensor available) and KNOW that the drive belt was stretching
> to the point where it was impacting operation.
>
> [And, that a stretched belt wasn't going to suddenly decide to
> unstretch to fix the problem!]
>
>> Back in that era I was doing a lot of repair work when I should have been
>> doing my homework.
>> So I knew that there were many unrelated kinds of hardware failure.
>
> The goal isn't to predict ALL failures but, rather, to anticipate
> LIKELY failures and treat them before they become an inconvenience
> (or worse).
>
> One morning, the (gas) furnace repeatedly tried to light as the
> thermostat called for heat.  Then, a few moments later, the
> safeties would kick in and shut down the gas flow.  This attracted my
> attention as the LIT furnace should STAY LIT!
>
> The furnace was too stupid to notice its behavior so would repeat
> this cycle, endlessly.
>
> I stepped in and overrode the thermostat to eliminate the call
> for heat as this behavior couldn't be productive (if something
> truly IS wrong, then why let it continue?  and, if there is nothing
> wrong with the controls/mechanism, then clearly it is unable to meet
> my needs so why let it persist in trying?)
>
> [Turns out, there was a city-wide gas shortage so there was enough
> gas available to light the furnace but not enough to bring it up to
> temperature as quickly as the designers had expected]

That's why the furnace designers couldn't have anticipated it.
They did not know that such a contition might occur so never tested for it.

>
>> A component could fail suddenly, such as a short circuit diode, and
>> everything would work fine after replacing it.
>> The cause could perhaps have been a manufacturing defect, such as
>> insufficient cooling due to poor quality assembly, but the exact real 
>> cause
>> would never be known.
>
> You don't care about the real cause.  Or, even the failure mode.
> You (as user) just don't want to be inconvenienced by the sudden
> loss of the functionality/convenience that the the device provided.

There will always be sudden unexpected loss of functionality for reasons 
which could not easily be predicted.
People who service lawn mowers in the area where I live are very busy right 
now.

>
>> A component could fail suddenly as a side effect of another failure.
>> One short circuit output transistor and several other components could 
>> also
>> burn up.
>
> So, if you could predict the OTHER failure...
> Or, that such a failure might occur and lead to the followup failure...
>
>> A component could fail slowly and only become apparent when it got to the
>> stage of causing an audible or visible effect.
>
> But, likely, there was something observable *in* the circuit that
> just hadn't made it to the level of human perception.

Yes a power supply ripple detection circuit could have turned on a warning 
LED but that never happened for at least two reasons.
1. The detection circuit would have increased the cost of the equipment and 
thus diminished the profit of the manufacturer.
2. The user would not have understood and would have ignored the warning 
anyway.

>
>> It would often be easy to locate the dried up electrolytic due to it 
>> having
>> already let go of some of its contents.
>>
>> So I concluded that if I wanted to be sure that I could always watch my
>> favourite TV show, we would have to have at least two TVs in the house.
>>
>> If it's not possible to have the equivalent of two TVs then you will want 
>> to
>> be in a position to get the existing TV repaired or replaced as quicky as
>> possible.
>
> Two TVs are affordable.  Consider two controllers for a wire-EDM machine.
>
> Or, the cost of having that wire-EDM machine *idle* (because you didn't
> have a spare controller!)
>
>> My home wireless Internet system doesn't care if one access point fails, 
>> and
>> I would not expect to be able to do anything to predict a time of 
>> failure.
>> Experience says a dead unit has power supply issues. Usually external but
>> could be internal.
>
> Again, the goal isn't to predict "time of failure".  But, rather, to be
> able to know that "this isn't going to end well" -- with some advance 
> notice
> that allows for preemptive action to be taken (and not TOO much advance
> notice that the user ends up replacing items prematurely).

Get feedback from the people who use your equpment.

>
>> I don't think it would be possible to "watch" everything because it's 
>> rare
>> that you can properly test a component while it's part of a working 
>> system.
>
> You don't have to -- as long as you can observe its effects on other
> parts of the system.  E.g., there's no easy/inexpensive way to
> check to see how much the belt on that CD/DVD player has stretched.
> But, you can notice that it HAS stretched (or, some less likely
> change has occurred that similarly interferes with the tray's actions)
> by noting how the activity that it is used for has changed.

Sure but you have to be the operator for that.
So you can be ready to help the tray open when needed.

>
>> These days I would expect to have fun with management asking for software 
>> to
========== REMAINDER OF ARTICLE TRUNCATED ==========