Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Rich <rich@example.invalid>
Newsgroups: comp.lang.tcl
Subject: Re: Does cheating produce faster searches?
Date: Wed, 25 Sep 2024 20:36:23 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 62
Message-ID: <vd1s87$3qlog$1@dont-email.me>
References: <20240925170149.3463bec9@lud1.home>
Injection-Date: Wed, 25 Sep 2024 22:36:23 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="c9bb3a4d1fe083989d2cd97db187f1d3";
	logging-data="4019984"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+v+D1FmeMs5xvkPiT9OVeM"
User-Agent: tin/2.6.1-20211226 ("Convalmore") (Linux/5.15.139 (x86_64))
Cancel-Lock: sha1:F1glGq1VFjhZB5iPM7i6MMM5LvA=
Bytes: 3560

Luc <luc@sep.invalid> wrote:
> Suppose I have a large list. Very large list. Then I want to search 
> for an item in that string:
> 
> % lsearch $list "string"
> 
> Now, suppose I have many lists instead. One list contains all the items 
> that begin with the letter a, another list contains all the items that 
> begin with the letter b, another list contains all the items that begin 
> with the letter c, and so on. Then I see what the first character in 
> "string" is and only search for it in the one corresponding list. 

What you have described is, at a very high level, similar to how B-Tree 
indexes work. 

https://en.wikipedia.org/wiki/B-tree

And they are the common index type used in databases for faster data 
retreival.

> Would that be faster? I somehow suspect the answer is 'no.'

The answer is actually "it depends".  You'd have to define "very 
large".  For some value of "very large" the extra time spent to pick 
which sublist to search will most likely be faster than doing a search 
across a single list containing all the items.

But, as is always the case, the algorithm matters.

For basic 'lsearch', it does, by default, a linear search.  It starts 
at index 0, and looks at each item sequentially until it finds a match, 
or hits the end of the list.  For this algorithm, your "separate based 
on first character" method will be faster for a not-so-big value of 
"very large".

But, lsearch also has the "-sorted" option.  For a "very large" list 
that you can maintain in sorted order, using "-sorted" causes lsearch 
to perform a binary search upon the list.

https://en.wikipedia.org/wiki/Binary_search

For this alternate algorithm, your "separate lists" method will need a 
much larger value for "very large" vs.  the linear search without 
"-sorted" before the "separate lists" are faster.

However, "-sorted" now leaves you with the need to have "very large 
list" be sorted.  And the time needed to sort the list, for a "very 
large" list, could be substantial.  So you'd need, in this case, a way 
to either only sort once, but use many times, or if the list is being 
modified, you'd want to be careful to modify it such that it is still 
sorted vs. needing to "sort" it over and over again.

> Bonus question: what about sqlite operations? Would they be faster if 
> I had one separate table for each initial letter/character?

Again, it depends.

If you have no indexes on your tables, then likely yes.

If you have indexes on the tables, that can also be utilized by the 
query you are running, then the size of the table will need to be much 
much larger before the "split up" version might become faster.