Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Tim Rentsch <tr.17687@z991.linuxsc.com>
Newsgroups: comp.lang.c
Subject: Re: filling area by color atack safety
Date: Thu, 21 Mar 2024 09:47:15 -0700
Organization: A noiseless patient Spider
Lines: 80
Message-ID: <86msqrlegc.fsf@linuxsc.com>
References: <ut3669$21eur$1@i2pn2.org> <86h6h3nvyz.fsf@linuxsc.com> <865xxiok09.fsf@linuxsc.com> <20240319131842.00002138@yahoo.com> <utbuk3$qed7$1@dont-email.me> <20240319191859.00004bc8@yahoo.com> <86v85glspp.fsf@linuxsc.com> <20240321153645.00002fdc@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: dont-email.me; posting-host="0bd4e75010dfbc7f5eb72cff193a6122";
	logging-data="2447740"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18yh7cVz/Q9T/rMiF+FFarflobbFFHMCaw="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:YTrs6xioZRGU9N3lpVs5bRmuEEg=
	sha1:rKzBvcA/3Xz90mqRGdS2QqEWZRA=
Bytes: 4731

Michael S <already5chosen@yahoo.com> writes:

> On Wed, 20 Mar 2024 10:26:58 -0700
> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>
>> Michael S <already5chosen@yahoo.com> writes:
>>
>> [...]
>>
>>> I did a little more investigation gradually modifying Tim's code
>>> for improved performance without changing the basic principle of
>>> the algorithm.  [...]
>>
>> Here is a rendition of my latest and fastest refinement.
>
> WOW, you really opened up your bag of tricks!

That I did, that I did. :)

I can do this kind of stuff when I need to.  Usually I don't need
to.

> Power-of-two sized circular buffers is something that I tend to
> use on smaller systems, like DSPs or MCUs rather than on "big"
> computers.  But, of course, on "big" computers it also helps.

Bitwise '&' is simply faster than '%'.  Also, bitwise '&' works
on unsigned types in the event that there is wraparound, but '%'
probably doesn't.

> Packing {x,y} into 32-bit word is a bit dirty.  I'd guess that we
> can justify it by claiming that original code although has similar
> limitation of width*height <= INT_MAX.

Yes, it is a bit dirty.  In practice pixel fields almost never get
above 16 bits in each direction, and the code is easy enough to
change (by putting two 32-bit quantities into a 64-bit type) if
it becomes necessary to accommodate an enormous pixel field.

> Removal of FIFO empty and almost-full tests in the inner loop helps
> solid shapes, but appears to slow down "drawn" shapes.  Since solid
> shapes are the slowest to fill, it is probably a good trade-off.

Taking those tests out of the inner loop helps when there is big
frontier set, because the tests don't have to be done as often.
When the frontier set is small, as we would expect for long
skinny shapes, doing that doesn't help as much (and of course
other overhead may make it worse in such cases).

> Overall, it is faster than my implementation of your algorithm.
> Esp. so for solid shapes.  Esp. of esp. so on Intel Skylake CPUs
> where speed up is up to 1.75x.
>
> More complicated 'St. George Cross' algorithms are still faster
> for solid shapes and for shapes dominated by long horizontal or
> long vertical lines.  But they are ... well ... more complicated.
> And [on Skylake] their worst case ('slalom' shape) is somewhat
> slower in absolute sense than the worst case of your code (a solid
> bar).

I played around with a "non-local" (in your terminology) version
of my most recently posted code, and discovered some things.
First the non-local version is somewhat faster on some shapes,
but noticeably slower on others.  The non-local version is more
sensitive to which starting point is chosen.  In a way it looks
similar to what happens with compression algorithms - some cases
get better, others get decidedly worse.  I didn't do a lot of
experiments in an effort to determine what the range or relative
proportions of the different behaviors are.

After thinking about it a bit, it seems to me that a local-only
method is "more queue-like" and a non-local method is "more
stack-like".  Using a pure queue plods along very predictably,
never getting much better or much worse.  Being more stack-like
sometimes gets a speedup, but sometimes stumbles into the pit of
despair, which in the worse case needs a lot of memory and does
more memory shuffling.  So for a general algorithm I'd opt for a
local-only method.  Later on it might be good to use a more
tailored algorithm for special cases, if we can identify which
cases are special in a way that isn't too expensive.