Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Tim Rentsch Newsgroups: comp.lang.c Subject: Re: filling area by color atack safety Date: Thu, 21 Mar 2024 09:47:15 -0700 Organization: A noiseless patient Spider Lines: 80 Message-ID: <86msqrlegc.fsf@linuxsc.com> References: <86h6h3nvyz.fsf@linuxsc.com> <865xxiok09.fsf@linuxsc.com> <20240319131842.00002138@yahoo.com> <20240319191859.00004bc8@yahoo.com> <86v85glspp.fsf@linuxsc.com> <20240321153645.00002fdc@yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: dont-email.me; posting-host="0bd4e75010dfbc7f5eb72cff193a6122"; logging-data="2447740"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18yh7cVz/Q9T/rMiF+FFarflobbFFHMCaw=" User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux) Cancel-Lock: sha1:YTrs6xioZRGU9N3lpVs5bRmuEEg= sha1:rKzBvcA/3Xz90mqRGdS2QqEWZRA= Bytes: 4731 Michael S writes: > On Wed, 20 Mar 2024 10:26:58 -0700 > Tim Rentsch wrote: > >> Michael S writes: >> >> [...] >> >>> I did a little more investigation gradually modifying Tim's code >>> for improved performance without changing the basic principle of >>> the algorithm. [...] >> >> Here is a rendition of my latest and fastest refinement. > > WOW, you really opened up your bag of tricks! That I did, that I did. :) I can do this kind of stuff when I need to. Usually I don't need to. > Power-of-two sized circular buffers is something that I tend to > use on smaller systems, like DSPs or MCUs rather than on "big" > computers. But, of course, on "big" computers it also helps. Bitwise '&' is simply faster than '%'. Also, bitwise '&' works on unsigned types in the event that there is wraparound, but '%' probably doesn't. > Packing {x,y} into 32-bit word is a bit dirty. I'd guess that we > can justify it by claiming that original code although has similar > limitation of width*height <= INT_MAX. Yes, it is a bit dirty. In practice pixel fields almost never get above 16 bits in each direction, and the code is easy enough to change (by putting two 32-bit quantities into a 64-bit type) if it becomes necessary to accommodate an enormous pixel field. > Removal of FIFO empty and almost-full tests in the inner loop helps > solid shapes, but appears to slow down "drawn" shapes. Since solid > shapes are the slowest to fill, it is probably a good trade-off. Taking those tests out of the inner loop helps when there is big frontier set, because the tests don't have to be done as often. When the frontier set is small, as we would expect for long skinny shapes, doing that doesn't help as much (and of course other overhead may make it worse in such cases). > Overall, it is faster than my implementation of your algorithm. > Esp. so for solid shapes. Esp. of esp. so on Intel Skylake CPUs > where speed up is up to 1.75x. > > More complicated 'St. George Cross' algorithms are still faster > for solid shapes and for shapes dominated by long horizontal or > long vertical lines. But they are ... well ... more complicated. > And [on Skylake] their worst case ('slalom' shape) is somewhat > slower in absolute sense than the worst case of your code (a solid > bar). I played around with a "non-local" (in your terminology) version of my most recently posted code, and discovered some things. First the non-local version is somewhat faster on some shapes, but noticeably slower on others. The non-local version is more sensitive to which starting point is chosen. In a way it looks similar to what happens with compression algorithms - some cases get better, others get decidedly worse. I didn't do a lot of experiments in an effort to determine what the range or relative proportions of the different behaviors are. After thinking about it a bit, it seems to me that a local-only method is "more queue-like" and a non-local method is "more stack-like". Using a pure queue plods along very predictably, never getting much better or much worse. Being more stack-like sometimes gets a speedup, but sometimes stumbles into the pit of despair, which in the worse case needs a lot of memory and does more memory shuffling. So for a general algorithm I'd opt for a local-only method. Later on it might be good to use a more tailored algorithm for special cases, if we can identify which cases are special in a way that isn't too expensive.