Path: ...!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Don Y Newsgroups: sci.electronics.design Subject: Re: OT: Programming Languages Date: Sat, 2 Nov 2024 13:48:26 -0700 Organization: A noiseless patient Spider Lines: 79 Message-ID: References: <14qaij9n9tjnr6gc1lfu4o69hb68gtcguf@4ax.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Sat, 02 Nov 2024 21:48:41 +0100 (CET) Injection-Info: dont-email.me; posting-host="422b1aa47a06639373e728e5dc3d74b1"; logging-data="4147240"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/wnldkDCL0wWa8ehp7/R14" User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.2.2 Cancel-Lock: sha1:GGgBRlJ7BNTM4jSJEc/oNo1ckuQ= Content-Language: en-US In-Reply-To: Bytes: 5248 On 11/2/2024 11:01 AM, Martin Brown wrote: > Most compilers these days are smart enough to move loop invariants outside of a > loop and then dispose of the loop. You must have side effects in any code that > you want to benchmark. Optimisers can be *really* smart about rearranging code > for maximum performance by avoiding pipeline stalls. Only the very best humans > can match them now. You need a few more asterisks stressing "really"! What's most amusing is how the folks who write "clever"/obscure code fragments THINKING they are "optimizing" it just annoy the compiler. On any substantial piece of code, "you" simply can't outperform it. Your mind gets tired. You make mistakes. The compiler just plows ahead. EVERY TIME IT IS INVOKED! > Every now and then you stumble upon a construct that on certain platforms is > unreasonably fast (2x or 4x). Increasingly because it has vectorised a loop on > the fly when all go faster stripes are enabled. A developer /aware of the platform on which the code will execute/ can often design a better *algorithm* to beat the compiler's optimizations of a poorer algorithm. I spend a lot of time thinking about how my data is structured to exploit features present in the hardware. E.g., a traditional mind would group all of a task's state into a single struct. But, that will almost certainly span a couple of cache lines. So, when looking at scheduling decisions, the *processor* will be flitting around between many cache lines -- examining *a* piece of data in each. So, more trips to memory to fill those other cache lines just to examine that one datum in each cache line "wasted" on it's fetch. Instead, group the parameters from MANY tasks in such a way that the examination of the datum for the first task drags similar data into that cache line for the *next* task's parameters; leverage the effort already expended on THAT cache line instead of just (likely) discarding it in favor of fetching another line. So, instead of just knowing, e.g., when to use a particular type of search or sort algorithm (based on a characterization of the data to be searched/sorted), you think about the "hardware algorithm" that your code is invoking to support whatever your hardware is doing. Note how large caches have become on modern processors. And, the wasted opportunities they represent for multithreaded implementations ("Gee, all of that data in the cache that I thought I could make use of is now largely useless as the next task isn't likely to benefit from it!") [Another argument affecting the choice of implementation languages; locality of reference. Stack computers, anyone??] > Precision timers and benchmarking tools are available on most platforms no need > to use a stop watch unless you enjoy watching paint dry. This is complicated if you have a interrupts that can nickel and dime your execution time. Benchmarking, in general, is fraught with perils for the naive. Few truly understand the scope of their *select* "micro-optimizations". In the 80's, there was an ongoing split in the Motogorilla vs. Inhell camps over *system* designs. Folks would make arbitrary claims (backed up with REAL data) to support why X was better than Y. But, they rarely looked at the whole picture. And, a product *is* "the whole picture". "Yeah, its nice that one opcode fetch allows you to push/pull *all* of the processor state (vs. having to fetch an instruction to push/pop each individual item). But, your code is more than just push/pull operations. If you are constantly going back to memory to store the (temporary?) result of the last instruction, then having internal state that can be used to eliminate that memory access allows the *process* to run faster. 'Let's hold memory DOLLARS constant and see how well things perform...'" [Gee, this processor runs at 8MHz while this other runs at 2... which is MORE PRODUCTIVE?]