Article <parsing-20250608013417@ram.dialup.fu-berlin.de>

Deutsch English Français Italiano
<parsing-20250608013417@ram.dialup.fu-berlin.de>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.roellig-ltd.de!open-news-network.org!weretis.net!feeder8.news.weretis.net!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.misc
Subject: Wrong ideas about chatbots
Date: 8 Jun 2025 00:39:21 GMT
Organization: Stefan Ram
Lines: 241
Expires: 1 Jun 2026 11:59:58 GMT
Message-ID: <parsing-20250608013417@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de +2sUOXY+kpLJlgMvFL8n0wcllNctxRrxlvsQ6j/Xzy4wlB
Cancel-Lock: sha1:qxqbt9tJU6/P3n+sEyMaj96QzNA= sha256:6RPW+O4/gr5BxhVevTD1+rPvoIyobrx6Kp8CllxMKRc=
X-Copyright: (C) Copyright 2025 Stefan Ram. All rights reserved.
	Distribution through any means other than regular usenet
	channels is forbidden. It is forbidden to publish this
	article in the Web, to change URIs of this article into links,
        and to transfer the body without this notice, but quotations
        of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
	services to mirror the article in the web. But the article may
	be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Bytes: 10100

Ben Collver <bencollver@tilde.pink> wrote or quoted:
|         you'd start with an enormous quantity of text, then do a lot
|of computationally-intense statistical analysis to map out which
|words and phrases are most likely to appear near to one another.

  I had already explained why that was off, but let me give you
  all a recent example from a recent chat of mine with a chatbot.

  I asked the chatbot to write a program for left-associative parsing
  of English. He must have mixed up my "left-associative" with the
  more common "left-recursive" or figured I just said it wrong. 

  He clearly did not know about this specific left-associative
  parsing method for natural languages. Even after I gave him
  the exact source, he still did not get it right.

  Then I laid it out for him [1]. After that, he wrote a program [2]
  for left-associative parsing of English. Here is what he produced: 
  [3]. I also asked him for an explanation of the approach to laymen, 
  so if you want to learn more about it, see [4].

  So tell me, how is real understanding like this supposed 
  to happen if chatbots just work based on "a statistical 
  analysis of which words often show up together"?

  [1] How I explained it to him, after his first program was
  not what I wanted

Maybe you're mislead applying standard terms to the special
NEWCAT approach. 

It starts by an empty object and then always appends the next
word until the end of text is reached. No big recursion there.

We get: empty + "the" 

Now, we need a grammar rule to see if that combination is
possible. For this purpose each of "empty" and "the" have 
a category which is a data structure with their attributes,
and the grammar rule than checks whether two things with these
categories can be combined and if so, it creates the new
sentence start "empty + 'the'" with a new category given by
the rule. 

Then we try to add "cat" to the sentence start. So the sentence 
is build left-to-right, that's what's "left-associative" about it. 

Finally, we add the "." after "fish" and then we may have 
a complete sentence if everything was allowed by the rules. 

This parsing is trivial. The crucial thing is to write the
categories and the rules so that all continuations of a
sentence start that are legal in English are allowed by the
rules and all others are rejected. 

  [2] The parser for a tiny subset of English he wrote then,
      according to my above explanation

import builtins
import sys
import time

# Toy lexicon
LEXICON = {
    'the': ('DET', {}),
    'cat': ('N', {'number': 'sg'}),
    'cats': ('N', {'number': 'pl'}),
    'dog': ('N', {'number': 'sg'}),
    'dogs': ('N', {'number': 'pl'}),
    'and': ('CONJ', {}),
    'eats': ('V', {'number': 'sg'}),
    'eat': ('V', {'number': 'pl'}),
}

def get_category(word):
    if word not in LEXICON:
        raise ValueError(f"Unknown word: {word}")
    return LEXICON[word]

def combine(state, next_token):
    cat, features = next_token
    print(f"    [combine] State: {state}, Next: ({cat}, {features})")
    # If state is None, we're at the start
    if state is None:
        if cat == 'DET':
            print("    [combine] Start DET")
            return ('subj', {'number': None, 'is_complete': False, 'needs_conj': False, 'pending_det': True})
        elif cat == 'N':
            print("    [combine] Start bare N (fail)")
            return None
        else:
            print("    [combine] Start fail")
            return None
    # If we're building a subject
    if state[0] == 'subj':
        subj = state[1]
        # --- Coordination context: CONJ + DET + N ---
        if subj.get('needs_conj') and subj.get('pending_det') and cat == 'N':
            print("    [combine] CONJ + DET + N -> coordinated NP (plural)")
            return ('subj', {'number': 'pl', 'is_complete': True, 'needs_conj': False, 'pending_det': False})
        # --- Plain DET + N (not in coordination) ---
        if subj.get('pending_det') and cat == 'N':
            print("    [combine] DET + N -> NP")
            return ('subj', {'number': features['number'], 'is_complete': True, 'needs_conj': False, 'pending_det': False})
        # --- NP + CONJ ---
        if subj.get('is_complete') and cat == 'CONJ':
            print("    [combine] NP + CONJ")
            return ('subj', {'number': subj['number'], 'is_complete': False, 'needs_conj': True, 'pending_det': False})
        # --- CONJ + DET ---
        if subj.get('needs_conj') and cat == 'DET':
            print("    [combine] CONJ + DET")
            return ('subj', {'number': None, 'is_complete': False, 'needs_conj': True, 'pending_det': True, 'prev_number': subj['number']})
        # --- NP + V (agreement check) ---
        if subj.get('is_complete') and cat == 'V':
            print(f"    [combine] NP + V: subj number {subj['number']}, verb number {features['number']}")
            if subj['number'] == features['number']:
                print("    [combine] Agreement OK")
                return ('sent', {'number': subj['number']})
            else:
                print("    [combine] Agreement FAIL")
                return None
        print("    [combine] No rule applies (fail)")
        return None
    # If we already have a sentence, nothing more allowed
    if state[0] == 'sent':
        print("    [combine] Already sent (fail)")
        return None
    print("    [combine] No rule applies (fail)")
    return None

def parse(sentence):
    words = sentence.lower().split()
    state = None
    print(f"\nParsing: {sentence}")
    for word in words:
        print(f"  State: {state}, Next: {word}")
        cat = get_category(word)
        state = combine(state, cat)
        if state is None:
            print(f"  Failed at: {word}")
            return False
    print(f"  Final state: {state}")
    return state is not None and state[0] == 'sent'

# Test sentences
sentences = [
    "the cat eats",             # ✔️ singular
    "the cats eat",             # ✔️ plural
    "the cat and the dog eat",  # ✔️ coordination (plural)
    "the cat and the dogs eat", # ✔️ coordination (plural)
    "the cat and the dog eats", # ❌ subject-verb disagreement
    "the cats eats",            # ❌ subject-verb disagreement
    "the cat and eat",          # ❌ incomplete NP
    "cat eats",                 # ❌ missing DET
    "the cat and the",          # ❌ incomplete NP
]

for s in sentences:
    result = parse(s)
    print(f"Result: {'ACCEPT' if result else 'REJECT'}")

  [3] the abridged output of the parser

Parsing: the cat eats
Result: ACCEPT

Parsing: the cats eat
Result: ACCEPT

Parsing: the cat and the dog eat
Result: ACCEPT
========== REMAINDER OF ARTICLE TRUNCATED ==========