Article <uvl30j$phap$3@dont-email.me>

Deutsch English Français Italiano
<uvl30j$phap$3@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Don Y <blockedofcourse@foo.invalid>
Newsgroups: sci.electronics.design
Subject: Re: Predictive failures
Date: Mon, 15 Apr 2024 22:40:29 -0700
Organization: A noiseless patient Spider
Lines: 253
Message-ID: <uvl30j$phap$3@dont-email.me>
References: <uvjn74$d54b$1@dont-email.me>
 <uvk2sk$1p01$1@nnrp.usenet.blueworldhosting.com>
 <uvkqqu$o5co$1@dont-email.me>
 <uvktun$2kjj$1@nnrp.usenet.blueworldhosting.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 16 Apr 2024 07:40:38 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="8f873457a009428ae193cacdeebfb978";
	logging-data="836953"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX180lebDKiCccgAVAVlm/yQ1"
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.2.2
Cancel-Lock: sha1:StYHbMsZPCFz9UApf34JIU2gVZ4=
Content-Language: en-US
In-Reply-To: <uvktun$2kjj$1@nnrp.usenet.blueworldhosting.com>
Bytes: 11832

On 4/15/2024 9:14 PM, Edward Rawde wrote:
>>> It always puzzled me how HAL could know that the AE-35 would fail in the
>>> near future, but maybe HAL had a motive for lying.
>>
>> Why does your PC retry failed disk operations?
> 
> Because the software designer didn't understand hardware.

Actually, he DID understand the hardware which is why he retried
it instead of ASSUMING every operation would proceed correctly.

[Why bother testing the result code if you never expect a failure?]

> The correct approach is to mark that part of the disk as unusable and, if
> possible, move any data from it elsewhere quick.

That only makes sense if the error is *persistent*.  "Shit
happens" and you can get an occasional failed operation when
nothing is truly "broken".

(how do you know the HBA isn't the culprit?)

>> If I ask the drive to give
>> me LBA 1234, shouldn't it ALWAYS give me LBA1234?  Without any data
>> corruption
>> (CRC error) AND within the normal access time limits defined by the
>> location
>> of those magnetic domains on the rotating medium?
>>
>> Why should it attempt to retry this MORE than once?
>>
>> Now, if you knew your disk drive was repeatedly retrying operations,
>> would your confidence in it be unchanged from times when it did not
>> exhibit such behavior?
> 
> I'd have put an SSD in by now, along with an off site backup of the same
> data :)

So, any problems you have with your SSD, today, should be solved by using the
technology that will be invented 10 years hence!  Ah, that's a sound strategy!

>> Assuming you have properly configured a EIA232 interface, why would you
>> ever get a parity error?  (OVERRUN errors can be the result of an i/f
>> that is running too fast for the system on the receiving end)  How would
>> you even KNOW this was happening?
>>
>> I suspect everyone who has owned a DVD/CD drive has encountered a
>> "slow tray" as the mechanism aged.  Or, a tray that wouldn't
>> open (of its own accord) as soon/quickly as it used to.
> 
> If it hasn't been used for some time then I'm ready with a tiny screwdriver
> blade to help it open.

Why don't they ship such drives with tiny screwdrivers to make it
easier for EVERY customer to address this problem?

> But I forget when I last used an optical drive.

When the firmware in your SSD corrupts your data, what remedy will
you use?

You're missing the forest for the trees.

>> [Turns out, there was a city-wide gas shortage so there was enough
>> gas available to light the furnace but not enough to bring it up to
>> temperature as quickly as the designers had expected]
> 
> That's why the furnace designers couldn't have anticipated it.

Really?  You can't anticipate the "gas shutoff" not being in the ON
position?  (which would yield the same endless retry cycle)

> They did not know that such a contition might occur so never tested for it.

If they planned on ENDLESSLY retrying, then they must have imagined
some condition COULD occur that would lead to such an outcome.
Else, why not just retry *once* and then give up?  Or, not
retry at all?

>>> A component could fail suddenly, such as a short circuit diode, and
>>> everything would work fine after replacing it.
>>> The cause could perhaps have been a manufacturing defect, such as
>>> insufficient cooling due to poor quality assembly, but the exact real
>>> cause
>>> would never be known.
>>
>> You don't care about the real cause.  Or, even the failure mode.
>> You (as user) just don't want to be inconvenienced by the sudden
>> loss of the functionality/convenience that the the device provided.
> 
> There will always be sudden unexpected loss of functionality for reasons
> which could not easily be predicted.

And if they CAN'T be predicted, then they aren't germane to this
discussion, eh?

My concern is for the set of failure modes that can realistically
be anticipated.

I *know* the inverters in my monitors are going to fail.  It
would be nice if I knew before I was actively using one when
it went dark!

[But, most users would only use this indication to tell them
to purchase another monitor; "You have been warned!"]

> People who service lawn mowers in the area where I live are very busy right
> now.
> 
>>> A component could fail suddenly as a side effect of another failure.
>>> One short circuit output transistor and several other components could
>>> also
>>> burn up.
>>
>> So, if you could predict the OTHER failure...
>> Or, that such a failure might occur and lead to the followup failure...
>>
>>> A component could fail slowly and only become apparent when it got to the
>>> stage of causing an audible or visible effect.
>>
>> But, likely, there was something observable *in* the circuit that
>> just hadn't made it to the level of human perception.
> 
> Yes a power supply ripple detection circuit could have turned on a warning
> LED but that never happened for at least two reasons.
> 1. The detection circuit would have increased the cost of the equipment and
> thus diminished the profit of the manufacturer.

That would depend on the market, right?  Most of my computers have redundant
"smart" (i.e., internal monitoring and reporting) power supplies.  Because
they were marketed to folks who wanted that sort of reliability.  Because
a manufacturer who didn't provide that level of AVAILABILITY would quickly
lose market share.  The cost of the added components and "handling" is
small compared to the cost of lost opportunity (sales).

> 2. The user would not have understood and would have ignored the warning
> anyway.

That makes assumptions about the market AND the user.

If one of my machines signals a fault, I look to see what it is complaining
about:  is it a power supply failure (in which case, I'm now reliant on
a single power supply)?  is it a memory failure (in which case, a bank
of memory may have been disabled which means the machine will thrash
more and throughput will drop)?  is it a link aggregation error (and
network traffic will suffer)?

If I can't understand these errors, then I either don't buy a product
with that level of reliability *or* have someone on hand who CAN
understand the errors and provide remedies/advice.

Consumers will replace a PC because of malware, trashed registry,
creeping cruft, etc.  That's a problem with the consumer buying the
"wrong" sort of computing equipment for his likely method of use.
(buy a Mac?)

>>> My home wireless Internet system doesn't care if one access point fails,
>>> and
>>> I would not expect to be able to do anything to predict a time of
>>> failure.
>>> Experience says a dead unit has power supply issues. Usually external but
>>> could be internal.
>>
>> Again, the goal isn't to predict "time of failure".  But, rather, to be
>> able to know that "this isn't going to end well" -- with some advance
>> notice
>> that allows for preemptive action to be taken (and not TOO much advance
>> notice that the user ends up replacing items prematurely).
> 
> Get feedback from the people who use your equpment.

Users often don't understand when a device is malfunctioning.
Or, how to report the conditions and symptoms in a meaningful way.

I recall a woman I worked with ~45 years ago sitting, patiently,
========== REMAINDER OF ARTICLE TRUNCATED ==========