Article <2024Sep13.082347@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano
<2024Sep13.082347@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!2.eu.feeder.erje.net!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: big, fast, etc, was is Vax addressing sane today
Date: Fri, 13 Sep 2024 06:23:47 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 91
Message-ID: <2024Sep13.082347@mips.complang.tuwien.ac.at>
References: <vbd6b9$g147$1@dont-email.me> <2024Sep11.113204@mips.complang.tuwien.ac.at> <vbsh3q$3n09p$1@dont-email.me> <vbtqib$2sce$2@dont-email.me> <vbvhs3$2std$1@gal.iecc.com>
Injection-Date: Fri, 13 Sep 2024 10:13:02 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="da068ec287f3285dee7d4af894aeb142";
	logging-data="821640"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX182fF+jekf+BxqGszcleFTf"
Cancel-Lock: sha1:6iFQGHienWvcNjTNssGOfT1roEM=
X-newsreader: xrn 10.11
Bytes: 5904

John Levine <johnl@taugh.com> writes:
>According to Lawrence D'Oliveiro  <ldo@nz.invalid>:
>>You solve that by having multiple units of the cheap machines to achieve 
>>the same level of redundancy, or even more. That ends up being more cost-
>>effective than the mainframe.
>
>That's fine for workloads that work that way.
>
>Airline reservation systems historically ran on mainframes because
>when they were invented that's all there was (original SABRE ran on
>two 7090s) and they are business critical so they need to be very
>reliable.
>
>About 30 years ago some guys at MIT realized that route and fare
>search, which are some of the most demanding things that CRS do, are
>easy to parallelize and don't have to be particularly reliable -- if
>your search system crashes and restarts and reruns the search and the
>result is a couple of seconds late, that's OK. So they started ITA
>software which used racks of PC servers running parallel applications
>written in Lisp (they were from MIT) and blew away the competition.
>
>However, that's just the search part. Actually booking the seats and
>selling tickets stays on a mainframe or an Oracle system because
>double booking or giving away free tickets would be really bad.

Booking flights or seats can easily be distributed: each flight is
assigned to one computer.  To avoid double booking or free tickets
even in case of a computer crash, you use the usual transaction
processing approach, and report completion of the booking only when
the booking has reached persistent memory.  For persistent memory you
use SSDs with power-loss protection.

These SSDs, ECC RAM, RAID-1, redundant power supplies and UPSs protect
against most hardware failures, but availability is still a concern
(e.g., motherboard or CPU failure; that normally does not affect data
integrity if the other measures are taken, but it affects
availability).  To increase availability, you can use e.g., DRBD
(distributed replicated block device) to get the data on multiple
machines.

Concerning "real bad": Airlines overbook their flights as a matter of
policy to increase their revenue.  If they had a booking system that
double-booked, say, 1ppm of all bookings, they probably would not even
notice, and would deal with it in the same way they deal with the
cases when the overbooking actually results in too many passengers
arriving for the flight.  Likewise, free tickets are not an issue if
they occur rarely enough.  Do they want to spend a million on a
mainframe to avoid a revenue loss of $100k?  But in any case, that's
not the problem with cheap hardware.

The problems are: When the persistent storage fails, you lose all
transactions since the latest backup.  To avoid that, RAID-1 helps, or
a redundant distributed storage like DRBD, or a redundant distributed
transaction system.  You may also want more availability than a single
system with RAID-1 (with a spare system standing by) provides, then
you have to go for one of the redundant distributed approaches.

However, my impression from booking flights online is that reliability
of the booking platform is not at all a concern for the airlines.  And
as a customer, I find little difference between the booking front-end
erroring out or the transaction back-end being unavailable.

>There's also a rule of thumb about databases that says one system of
>performance 100 is much better than 100 systems of performance 1
>because those 100 systems will spend all their time contending for
>database locks.

If you handle each flight on one system, the contention for locks is
only within that one system.  And I expect that there is not that much
contention.  How many people book the same flight within the same
millisecond (or however long the lock is held)?

Concerning performance 100 vs. performance 1, about what systems are
you thinking?  z17 will have 32*8=256 cores (of unknown performance
that is likely to be disappointing, or IBM would not disallow
publishing benchmark results), compared to similar numbers of cores on
servers with AMD or Intel CPUs, or 16-24 cores on systems based on
desktop chips (with Intel you pay a heavy premium these days if you
want ECC memory, however).

Interestingly, with increasing number of cores per socket in recent
years, the number of sockets is going down.  E.g., the successor for
the HPE Superdome Flex with up to 32 sockets (up to 32*28=896 cores)
is the HPE Compute Scale-Up Server 3200 with up to 16 sockets
(16*60=960 cores).  Either there is little demand for single systems
with more cores, or there are technical difficulties (probably both).

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>