Article <vjsh6r$1s5j5$1@dont-email.me>

Deutsch English Français Italiano
<vjsh6r$1s5j5$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: transpiling to low level C
Date: Tue, 17 Dec 2024 12:51:04 -0600
Organization: A noiseless patient Spider
Lines: 203
Message-ID: <vjsh6r$1s5j5$1@dont-email.me>
References: <vjlh19$8j4k$1@dont-email.me>
 <vjn9g5$n0vl$1@raubtier-asyl.eternal-september.org>
 <vjnhsq$oh1f$1@dont-email.me> <vjnq5s$pubt$1@dont-email.me>
 <vjq5nr$1aari$3@dont-email.me> <vjrpcd$1nqeb$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 17 Dec 2024 19:51:08 +0100 (CET)
Injection-Info: dont-email.me; posting-host="10e7c9ead457ab57c374f57f86cb174f";
	logging-data="1971813"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18l6/Kp0Ciq7oJt/QHxnRtNtJfgapTy2bE="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:NTUqlB0MIDT+KRWKFOBUJvME0kM=
In-Reply-To: <vjrpcd$1nqeb$1@dont-email.me>
Content-Language: en-US
Bytes: 9189

On 12/17/2024 6:04 AM, bart wrote:
> On 16/12/2024 21:23, Lawrence D'Oliveiro wrote:
>> On Sun, 15 Dec 2024 17:53:30 -0600, BGB wrote:
>>
>>> As an IL, even C is a little overkill, unless turned into a restricted
>>> subset ...
>>
>> Why not use WASM as your IL?
> 
> Have you tried it? I mean, directly generating WASM from a compiler 
> front-end, not just using somebody else's tool to do so.
> 
> WASM is a stack-based language, but one that supposedly doesn't even 
> have branching, although there is a 'br' statement, with some restrictions.
> 
> Information about it is quite elusive; it took me 5 minutes to even get 
> examples of what it looks like (and I've seen it before).
> 
> C can apparently compile to WASM via Clang, so I tried this program:
> 
>   void F(void) {
>      int i=0;
>      while (i<10000) ++i;
>   }
> 
> which compiled to 128 lines of WASM (technically, some form of 'WAT', as 
> WASM is a binary format). The 60 lines correspondoing to F are shown 
> below, and below that, is my own stack IL code.
> 
> So, what do you with your WASM/WAT program once generated? I've no idea, 
> except that WASM is inextricably typed up with with browsers and with 
> JavaScript, in which I have no interest.
> 
> With C, you run a compiler; with ASM, an assembler; these formats are 
> well understood.
> 
> You can appreciate that it can be easier to devise your own format and 
> your own tools that you understand 100%.
> 
> 

Hmm... It looks like the WASM example is already trying to follow SSA 
rules, then mapped to a stack IL... Not necessarily the best way to do 
it IMO.


But, yeah, in BGBCC I am also using a stack-based IL (RIL), which 
follows rules more in a similar category to .NET CIL (in that, stack 
items carry type, and the stack is generally fully emptied on branch).


In my IL, labels are identified with a LABEL opcode (with an immediate), 
and things like branches work by having the branch target and label 
having the same immediate (label ID).

I ended up considering this preferable to byte offsets, as:
   Easier to generate from the front-end;
   LABEL also marks the start/end of basic blocks;
   ...

There are also opcodes to convey the source filename and line number, 
these don't generate any output but merely serve to transport filename 
and line number information (useful for debugging).


RIL was a little weird in that functions and variables are themselves 
defined via bytecode operations. This is unlike both JVM and .NET CIL, 
which had used external metadata/structures for defining functions and 
variables (nevermind the significant differences between JVM and .NET in 
this area).

This is pros/cons, main downside of the current format is that it 
requires the bytecode modules to be loaded sequentially and fully. This 
works OK for a compiler on a modern PC, but does impose on RAM somewhat 
for a compiler on a more memory-constrained target. One idea would be to 
individually wrap functions and have a mechanism so that they can be 
loaded dynamically. But, this hasn't really been done for my existing 
IL. Most likely option is that metadata continues to be defined via 
bytecode operations, just that each function is separately wrapped, and 
there may be an index to map function names to the corresponding "lump" 
(say, if using a WAD variant as the top-level container).

Say:
   Lump name is "FNC01234" (IWAD) or "func_1234" (WAD2).
   And there is a table mapping "FOO_SomeFunction" to "FNC01234" or 
"func_1234".
But, this sort of things, along with past ideas to try moving this over 
to a format along similar lines to RIFF/AVI, have generally fizzled 
(along with possible debate over to to the merits of a WAD-like or 
RIFF-like format).

Though, an arguably simpler option might be to just individually wrap 
the bytecode for each translation unit, and have an manifest of what 
symbols are present. In this way, it would function more like a 
traditional static library (as opposed to the current strategy of 
globing all of the translation units in the library into a single large 
blob of bytecode); and probably dumping the bytecode for each 
translation unit into a WAD (again, possibly either IWAD or WAD2, though 
probably WAD2 in this case, as the comparably larger lumps would 
eliminate most concern over the larger directory entries).



When converting to the 3AC IR, there is the quirk that function calls 
are split into multiple parts:
   The CALL operation, which ends the current basic-block;
   A CSRV operation, which is at the start of a new basic block.
     CSRV = Caller Save Return Value.

In cases where the 3AC was being interpreted, this was better, as the 
CSRV operation serves to save the return value from the called function 
to the correct place in the caller's frame (where the interpreter does 
not use recursion for its own operation).
Internal conversion to 3AC was faster than trying to directly interpret 
a stack bytecode (as well as 3AC being a better format for code generation).

> 
> 
> --------------------------------------
> F:                                      # @F
>      .functype    F () -> ()
>      .local      i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, 
> i32, i32, i32
> # %bb.0:
>      global.get    __stack_pointer
>      local.set    0
>      i32.const    16
>      local.set    1
>      local.get    0
>      local.get    1
>      i32.sub
>      local.set    2
>      i32.const    0
>      local.set    3
>      local.get    2
>      local.get    3
>      i32.store    12
> .LBB0_1:                                # =>This Inner Loop Header: Depth=1
>      block
>      loop                                        # label1:
>      local.get    2
>      i32.load    12
>      local.set    4
>      i32.const    10000
>      local.set    5
>      local.get    4
>      local.set    6
>      local.get    5
>      local.set    7
>      local.get    6
>      local.get    7
>      i32.lt_s
>      local.set    8
>      i32.const    1
>      local.set    9
>      local.get    8
>      local.get    9
>      i32.and
>      local.set    10
>      local.get    10
>      i32.eqz
>      br_if       1                               # 1: down to label0
> # %bb.2:                                #   in Loop: Header=BB0_1 Depth=1
>      local.get    2
>      i32.load    12
>      local.set    11
>      i32.const    1
>      local.set    12
>      local.get    11
>      local.get    12
>      i32.add
>      local.set    13
>      local.get    2
>      local.get    13
>      i32.store    12
>      br          0                               # 0: up to label1
========== REMAINDER OF ARTICLE TRUNCATED ==========