Path: ...!feeds.phibee-telecom.net!weretis.net!feeder6.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: Microarch Club Date: Wed, 27 Mar 2024 23:11:34 +0000 Organization: Rocksolid Light Message-ID: <4f63a339527a85e67bcd85c6f5388bfa@www.novabbs.org> References: <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org> <20240327012715.0000125c@yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="3443044"; mail-complaints-to="usenet@i2pn2.org"; posting-account="PGd4t4cXnWwgUWG9VtTiCsm47oOWbHLcTr4rYoM0Edo"; User-Agent: Rocksolid Light X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8 X-Spam-Checker-Version: SpamAssassin 4.0.0 X-Rslight-Site: $2y$10$eyg1twdVoEdwLYVO5GYvm.5667cnd2c/EIFm4YAt2FF5FWGA.70Jm Bytes: 3357 Lines: 53 Scott Lurndal wrote: > mitchalsup@aol.com (MitchAlsup1) writes: >>BGB wrote: >> >>> On 3/26/2024 5:27 PM, Michael S wrote: >>>> >>>> >>>> For slightly less then 20 years ARM managed OK without integer divide. >>>> Then in 2004 they added integer divide instruction in ARMv7 (including >>>> ARMv7-M variant intended for small microcontroller cores like >>>> Cortex-M3) and for the following 20 years instead of merely OK they are >>>> doing great :-) >>>> >> >>> OK. >> >>The point is they are doing better now after adding IDIV and FDIV. >> >>> I think both modern ARM and AMD Zen went over to "actually fast" integer >>> divide. >> >>> I think for a long time, the de-facto integer divide was ~ 36-40 cycles >>> for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I >>> can get from a shift-add unit. >> >>While those numbers are acceptable for shift-subtract division (including >>SRT variants). >> >>What I don't get is the reluctance for using the FP multiplier as a fast >>divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle >>FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ?? >>and with special casing of leading 1s and 0s average around 10-cycles ??? > Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles > (where s is the number of significant digits in the quotient). I submit that a 5+2×ln8(s) is faster still. 32-bits = 15 cycles 64-bits = 17 cycles {Log base 8, where one uses Newton-Raphson or Goldschmidt to get 8 significant digits (9.2 bits are correct) and double the significant bits each iteration (2-cycles). } 5 comes from looking at numerator and denominator to find the first bit of significance, and then shifting numerator and denominator so that the FDIV algorithm can work. > https://www.quinapalus.com/cm7cycles.html >> >>I submit that at 10-cycles for average latency, the need to invent screwy >>forms of even faster division fall by the wayside {accurate or not}.