Path: nntp.eternal-september.org!news.eternal-september.org!eternal-september.org!feeder3.eternal-september.org!weretis.net!feeder8.news.weretis.net!reader5.news.weretis.net!news.solani.org!.POSTED!not-for-mail
From: Mild Shock <janburse@fastmail.fm>
Newsgroups: comp.lang.prolog
Subject: Do Prologers know the Unicode Range? (Was: Most radical approach is
 Novacore from Dogelog Player)
Date: Fri, 27 Jun 2025 13:21:06 +0200
Message-ID: <103luqv$1cbpu$1@solani.org>
References: <vpceij$is1s$1@solani.org> <103bos1$164mt$1@solani.org>
 <103bpdh$164t1$1@solani.org> <103bqc8$165f2$1@solani.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 27 Jun 2025 11:21:03 -0000 (UTC)
Injection-Info: solani.org;
	logging-data="1453886"; mail-complaints-to="abuse@news.solani.org"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101
 Firefox/128.0 SeaMonkey/2.53.21
Cancel-Lock: sha1:icQDln4wfWSfHfdGomxZhjR5Nkg=
X-User-ID: eJwNy8EBwCAIA8CVQEmEcQRk/xHa+x82lXWMoGEw7F6eEBXWzcdTzzKX2HhgOd67yr2atke0C3FF3f4TDo3+AFK2FQA=
In-Reply-To: <103bqc8$165f2$1@solani.org>

The official replacement character is 0xFFFD:

 > Replacement Character
 > https://www.compart.com/de/unicode/U+FFFD

Well that is what people did in the past, replace
non-printables by the ever same code, instead of
using ‘\uXXXX’ notation. I have studied the

library(portray_text) extensively. And my conclusion
is still that it extremly ancient.

For example I find:

mostly_codes([H|T], Yes, No, MinFactor) :-
     integer(H),
     H >= 0,
     H =< 0x1ffff,
     [...]
    ;   catch(code_type(H, print),error(_,_),fail),
     [...]

https://github.com/SWI-Prolog/swipl-devel/blob/eddbde61be09b95eb3ca2e160e73c2340744a3d2/library/portray_text.pl#L235

Why even 0x1ffff and not 0x10ffff, this is a bug,
do you want to starve is_text_code/1 ? The official
Unicode range is 0x0 to 0x10ffff. Ulrich Neumerkel

often confused the range in some of his code snippets,
maybe based on a limited interpretation of Unicode.
But if one would switch to chars one could easily

support any Unicode code point even without
knowing the range. Just do this:

mostly_chars([H|T], Yes, No, MinFactor) :-
     atom(H),
     atom_length(H, 1),
     [...]
    ;  /* printable check not needed */
     [...]

Mild Shock schrieb:
> Hi,
> 
> The most radical approach is Novacore from
> Dogelog Player. It consists of the following
> major incisions in the ISO core standard:
> 
> - We do not forbid chars, like for example
>    using lists of the form [a,b,c], we also
>    provide char_code/2 predicate bidirectionally.
> 
> - We do not provide and _chars built-in
>    predicates also there is nothing _strings. The
>    Prolog system is clever enough to not put
>    every atom it sees in an atom table. There
>    is only a predicate table.
> 
> - Some host languages have garbage collection that
>    deduplicates Strings. For example some Java
>    versions have an options to do that. But we
>    do not have any efforts to deduplicate atoms,
>    which are simply plain strings.
> 
> - Some languages have constant pools. For example
>    the Java byte code format includes a constant
>    pool in every class header. We do not do that
>    during transpilation , but we could of course.
>    But it begs the question, why only deduplicate
>    strings and not other constant expressions as well?
> 
> - We are totally happy that we have only codes,
>    there are chances that the host languages use
>    tagged pointers to represent them. So they
>    are represented similar to the tagged pointers
>    in SWI-Prolog which works for small integers.
> 
> - But the tagged pointer argument is moot,
>    since atom length=1 entities can be also
>    represented as tagged pointers, and some
>    programming languages do that. Dogelog Player
>    would use such tagged pointers without
>    poluting the atom table.
> 
> - What else?
> 
> Bye
> 
> Mild Shock schrieb:
>>
>> Technically SWI-Prolog doesn't prefer codes.
>> Library `library(pure_input)` might prefer codes.
>> But this is again an issue of improving the
>> library by some non existent SWI-Prolog community.
>>
>> The ISO core standard is silent about a flag
>> back_quotes, but has a lot of API requirements
>> that support both codes and chars, for example it
>> requires atom_codes/2 and atom_chars/2.
>>
>> Implementation wise there can be an issue,
>> like one might decide to implement the atoms
>> of length=1 more efficiently, since with Unicode
>> there is now an explosion.
>>
>> Not sure whether Trealla Prolog and Scryer
>> Prolog thought about this problem, that the
>> atom table gets quite large. Whereas codes don't
>> eat the atom table. Maybe they forbit predicates
>>
>> that have an atom of length=1 head:
>>
>> h(X) :-
>>      write('Hello '), write(X), write('!'), nl.
>>
>> Does this still work?
>>
>> Mild Shock schrieb:
>>> Concerning library(portray_text) which is in limbo:
>>>
>>>  > Libraries are (often) written for either
>>> and thus the libraries make the choice.
>>>
>>> But who writes these libraries? The SWI Prolog
>>> community. And who doesn’t improve these libraries,
>>> instead floods the web with workaround tips?
>>> The SWI Prolog community.
>>>
>>> Conclusion the SWI-Prolog community has itself
>>> trapped in an ancient status quo, creating an island.
>>> Cannot improve its own tooling, is not willing
>>> to support code from else where that uses chars.
>>>
>>> Same with the missed AI Boom.
>>>
>>> (*) Code from elsewhere is dangerous, People
>>> might use other Prolog systems than only SWI-Prolog,
>>> like for exampe Trealla Prolog and Scryer Prolog.
>>>
>>> (**) Keeping the status quo is comfy. No need to
>>> think in terms of programm code. Its like biology
>>> teachers versus pathology staff, biology teachers
>>> do not everyday see opened corpses.
>>>
>>>
>>> Mild Shock schrieb:
>>>>
>>>> Inductive logic programming at 30
>>>> https://arxiv.org/abs/2102.10556
>>>>
>>>> The paper contains not a single reference to autoencoders!
>>>> Still they show this example:
>>>>
>>>> Fig. 1 ILP systems struggle with structured examples that
>>>> exhibit observational noise. All three examples clearly
>>>> spell the word "ILP", with some alterations: 3 noisy pixels,
>>>> shifted and elongated letters. If we would be to learn a
>>>> program that simply draws "ILP" in the middle of the picture,
>>>> without noisy pixels and elongated letters, that would
>>>> be a correct program.
>>>>
>>>> I guess ILP is 30 years behind the AI boom. An early autoencoder
>>>> turned into transformer was already reported here (*):
>>>>
>>>> SERIAL ORDER, Michael I. Jordan - May 1986
>>>> https://cseweb.ucsd.edu/~gary/PAPER-SUGGESTIONS/Jordan-TR-8604-OCRed.pdf 
>>>>
>>>>
>>>> Well ILP might have its merits, maybe we should not ask
>>>> for a marriage of LLM and Prolog, but Autoencoders and ILP.
>>>> But its tricky, I am still trying to decode the da Vinci code of
>>>>
>>>> things like stacked tensors, are they related to k-literal clauses?
>>>> The paper I referenced is found in this excellent video:
>>>>
>>>> The Making of ChatGPT (35 Year History)
>>>> https://www.youtube.com/watch?v=OFS90-FX6pg
>>>>
========== REMAINDER OF ARTICLE TRUNCATED ==========