Deutsch English Français Italiano |
<875xtn87u5.fsf@localhost> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Lynn Wheeler <lynn@garlic.com> Newsgroups: comp.arch Subject: Re: Architectural implications of locate mode I/O Date: Tue, 02 Jul 2024 17:36:50 -1000 Organization: Wheeler&Wheeler Lines: 110 Message-ID: <875xtn87u5.fsf@localhost> References: <v61jeh$k6d$1@gal.iecc.com> <v61oc8$1pf3p$1@dont-email.me> <HYZgO.719$xA%e.597@fx17.iad> <8bfe4d34bae396114050ad1000f4f31c@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain Injection-Date: Wed, 03 Jul 2024 05:36:53 +0200 (CEST) Injection-Info: dont-email.me; posting-host="ce340dd851f633aef133aba63d783337"; logging-data="2138502"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19CK82e2uDL/OOIUCftMsohc2swQ50+8WA=" User-Agent: Gnus/5.13 (Gnus v5.13) Cancel-Lock: sha1:UT3+cHsRy1ovw7Lhs+6o8AtZB3o= sha1:bxZjVpyyU7CoSc/Nwr+5Whn8TjQ= Bytes: 6653 mitchalsup@aol.com (MitchAlsup1) writes: > Once you recognize that I/O is eating up your precious CPU, and you > get to the point you are willing to expend another fixed programmed > device to make the I/O burden manageable, then you basically have > CDC 6600 Peripheral Processors, programmed in code or microcode. QSAM library does serialization for the application ... get/put calls does "wait" operations inside the library for I/O complete. BSAM library has the applications performing serialization with their own "wait" operations for read/write calls (application handling overlap of possible processing with I/O). Recent IBM articles mentioning that QSAM default multiple buffering was established years ago as "five" ... but current recommendations are for more like 150 (for QSAM to have high processing overlapped with I/O). Note while they differentiate between application buffers and "system" buffers (for move & locate mode), QSAM (system) buffers run was part of application address space but are managed as part of QSAM library code. Both QSAM & BSAM libraries build the (application) channel programs .... and since OS/360 move to virtual memory for all 370s, they all have (application address space) virtual addresses. When the library code passes the channel program to EXCP/SVC0, a copy of the passed channel programs are made, replacing the virtual addresses in the CCWs with "real addresses". QSAM GET can return the address within its buffers (involved in the actual I/O, "locate" mode) or copy data from its buffers to the application buffers ("move" mode). The references on the web all seemed to reference "system" and "application" buffers, but I think it would be more appropriate to reference them as "library" and "application" buffers. 370/158 had "integrated channels" ... the 158 engine ran both 370 instruction set microcode and the integrated channel microcode. when future system imploded, there was mad rush to get stuff back into the 370 product pipelines, including kicking off the quick&dirty 303x&3081 efforts in parallel. for 303x they created "external channels" by taking a 158 engine with just the integrated channel microcode (and no 370 microcode) for the 303x "channel director". a 3031 was two 158 engines, one with just the 370 microcode and a 2nd with just the integrated channel microcode a 3032 was 168-3 remapped to use channel director for external channels. a 3033 started with 168-3 logic remapped to 20% faster chips. Jan1979, I had lots of use of an early engineering 4341 and was con'ed into doing a (cdc6600) benchmark for national lab that was looking for 70 4341s for computer farm (sort of leading edge of the coming cluster supercomputing tsunami). Benchmark was fortran compute doing no I/O and executed with nothing else running. 4341: 36.21secs, 3031: 37.03secs, 158: 45.64secs now integrated channel microcode ... 158 even with no I/O running was still 45.64secs compared to the same hardware in 3031 but w/o channel microcode: 37.03secs. I had a channel efficiency benchmark ... basically how fast can channel handle each channel command word (CCW) in a channel program (channel architecture required it fetched, decoded and executed purely sequentially/synchronously). Test was to have two disk read ("chained") CCWs for two consecutive records. Then add a CCW between the two disk read CCWs (in the channel program) ... which results in a complete revolution to read the 2nd data record (because the latency, while disk is spinning, in handling the extra CCW separating the two record read CCWs). Then reformat the disk to add a dummy record between each data record, gradually increasing the size of the dummy record until the two data records can be done in single revolution. The size of the dummy record required for single revolution reading the two records was the largest for 158 integrated channel as well as all the 303x channel directors. The original 168 external channels could do single revolution with the smallest possible dummy record (but a 168 with channel director, aka 3032, couldn't, nor could 3033) ... also the 4341 integrated channel microcode could do it with smallest possible dummy record. The 3081 channel command word (CCW) processing latency was more like 158 integrated channel (and 303x channel directors) Second half of the 80s, I was member of Chesson's XTP TAB ... found a comparison between typical UNIX at the time for TCP/IP had on the order of 5k instructions and five buffer copies ... while compareable mainframe protocol in VTAM had 160k instructions and 15 buffer copies (larger buffers on high-end cache machines ... the cache misses for the 15 buffer copies could exceed processor cycles for the 160k instructions). XTP was working on no buffer copies and streaming I/O ... attempting to process TCP as close as possible to no buffer copy disk I/O. Scatter/gather I/O for separate header and data ... also move from header CRC protocol .... to trailor CRC protocol ... instead of software prescanning the buffer to calculate CRC (for placing in header) .... outboard processing the data as it streams through, doing the CRC and then appended to the end of the record. When doing IBM's HA/CMP and working with major RDBMS vendors on cluster scaleup in late 80s/early 90s, there was lots of references to POSIX light-weight threads and asynchronous I/O for RDBMS (with no buffer copies) and the RDBMS managing large record cache. -- virtualization experience starting Jan1968, online at home since Mar1970