Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: "Paul A. Clayton" Newsgroups: comp.arch Subject: Re: Computer architects leaving Intel... Date: Fri, 30 Aug 2024 20:11:23 -0400 Organization: A noiseless patient Spider Lines: 69 Message-ID: References: <2644ef96e12b369c5fce9231bfc8030d@www.novabbs.org> <2f1a154a34f72709b0a23ac8e750b02b@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Sat, 31 Aug 2024 18:36:24 +0200 (CEST) Injection-Info: dont-email.me; posting-host="d6d79495e4273d661f9e084568564875"; logging-data="1144905"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18hxraeYrPDlEcItRUXtmgYGMIK7fGRfcY=" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.0 Cancel-Lock: sha1:pACqdMlaQgSGkRZ0DUg3P8G/KGE= In-Reply-To: Bytes: 4712 On 8/28/24 11:36 PM, BGB wrote: > On 8/28/2024 11:40 AM, MitchAlsup1 wrote: [snip] >> My 1-wide machines does ENTER and EXIT at 4 registers per cycle. >> Try doing 4 LDs or 4 STs per cycle on a 1-wide machine. > > > It likely isn't going to happen because a 1-wide machine isn't > going to have the needed register ports. For an in-order implementation, banking could be used for saving a contiguous range of registers with no bank conflicts. Mitch Alsup chose to provide four read/write ports with the typical use being three read, one write instructions. This not only facilitates faster register save/restore for function calls (and context switches/interrupts) but presents the opportunity of limited dual issue ("CoIssue"). I do not know what the power and area costs are for read/write vs. dedicated ports nor for four ports versus three ports. I suspect three-read, one-write instuctions are not common generally and often a read can be taken from the forwarding network or by stealing a read port from a later instruction (that only needed one read port). (One could argue that not introducing a performance hiccup in uncommon cases would justify the modest cost of the extra port. Reducing performance to two-thirds in the case where every instruction is three-read and a read can only be stolen from a following three-read instruction — and this still requires buffering of the reads and complexity of scheduling, which might hurt frequency.) At some point of width, undersupplying register ports makes sense both because port cost increases with count and wider issue is unlikely to support N wide for every instruction type (and larger samples are more likely to have fewer cases where the structural hazard of fewer than worst case register ports will be encountered). Adding out-of-order execution further reduces the performance impact of hazards. (A simple one-wide pipeline stalls for any hazard. An in-order two-wide pipeline would not always dual issue due to dependencies even with perfect cache, so adding extra stalls from only having four register read ports, e.g., would not hurt performance as much as a two read port limit for a scalar design. Out-of-order tends to further average out resource use.) [I think more cleverly managing communication and storage has potential for area and power saving. Repeating myself, any-to-any communication seems expensive and much communication is more local. The memory hierarchy typically has substantial emphasis on "network locality" and not just spatial and temporal locality (separate instruction and data caches, per core caches/registers), but I believe there is potential for improving communication and storage "network locality" within a core. Sadly, I have not **worked** on how communication might be improved.] > But, if one doesn't have the register ports, there is likely no > viable way to move 4 registers/cycle to/from memory (and it > wouldn't make sense for the register file to have a path to memory > that is wider than what the pipeline has). My 66000's VVM encourages wide cache access even with relatively narrow execution resources. A simple vector MADD could use four times the cache bandwidth (in register width) as execution bandwidth (in scalar operations), so loading/storing four sequential-in-memory values per cycle could keep a single MADD unit busy.