Article <vnpkfj$14e4b$1@dont-email.me>

Deutsch English Français Italiano
<vnpkfj$14e4b$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Sun, 2 Feb 2025 23:33:33 -0600
Organization: A noiseless patient Spider
Lines: 207
Message-ID: <vnpkfj$14e4b$1@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad>
 <2025Feb2.184458@mips.complang.tuwien.ac.at> <vnocer$q8bq$1@dont-email.me>
 <vnou2r$t5qd$1@dont-email.me> <YtVnP.202231$HxS1.48250@fx39.iad>
 <NNVnP.88582$oCrf.56776@fx33.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 03 Feb 2025 06:33:39 +0100 (CET)
Injection-Info: dont-email.me; posting-host="f4fe680962b51b8bca49b43582a45572";
	logging-data="1194123"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/l3IuH2rV1bwF5n+DQP+AyGqfy4UWUzPs="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:/8yfcYRwKUp0FM+X5aILc/K45o4=
In-Reply-To: <NNVnP.88582$oCrf.56776@fx33.iad>
Content-Language: en-US
Bytes: 9592

On 2/2/2025 8:24 PM, EricP wrote:
> EricP wrote:
>> BGB wrote:
>>> On 2/2/2025 12:10 PM, Thomas Koenig wrote:
>>>> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>>>
>>>>> The OS must also be able to keep both pages in physical memory until
>>>>> the access is complete, or there will be no progress.  Should not be a
>>>>> problem these days, but the 48 pages or so potentially needed by VAX
>>>>> complicated the OS.
>>>>
>>>> 48 pages?  What instruction would need that?
>>>
>>> Hmm...
>>>
>>>
>>> I ended up with a 4-way set associative TLB as it ended up being 
>>> needed to avoid the CPU getting stuck in a TLB-miss loop in the 
>>> worst-case scenario:
>>> An instruction fetch where the line-pair crosses a page boundary (and 
>>> L1 I$ misses) for an instruction accessing a memory address where the 
>>> line-pair also crosses a page boundary (and the L1 D$ misses).
>>>
>>> One can almost get away with two-way, except that almost inevitably 
>>> the CPU would encounter and get stuck in an infinite TLB miss loop 
>>> (despite the seeming rarity, happens roughly once every few seconds 
>>> or so).
>>>
>>> ....
>>>
>>
>> That is because you have a software managed TLB so all PTE's
>> referenced by an instruction must be resident in TLB for success.
>> If three PTE are required by an instruction and they map to
>> the same 2-way row and conflict evict then bzzzzt livelock loop.
>>
>> So you need at least as many set assoc TLB ways as the worst case VA's
>> referenced by any instruction.
> 
> And this just accounts for the instruction that TLB-miss'ed.
> If the TLB-miss handler code or data itself can possibly conflict
> on the same TLB row then you have to add 2, 3 or 4 more ways for it.
> 

The interrupt handlers are always run with MMU disabled.
   In this case, interrupt handlers may not have any TLB misses.

But, any memory accesses into the virtual address space need to be 
emulated in software (via page-walks and a soft-TLB).

If the interrupt handlers ran with MMU enabled, the CPU would also need 
to be able to deal with recursive interrupts. At present, this is not a 
thing, and the design of the interrupt mechanism can't currently allow 
for this (and other interrupts are effectively blocked until the handler 
finishes, with a "General Fault" that happens within an interrupt 
handler stalling the CPU core until an external RESET signal is asserted).

In most cases, the interrupt handlers are short lived, with more general 
long-lived operations (such as syscall handling) being performed via a 
context switch.

Currently, page-fault handling does occur within the TLB-miss interrupt, 
had gone back and forth as to whether to handle page-fault similar to a 
system call, and initiate a context switch to a dedicated page-fault 
handler tasks.

Isn't great, but basically works.


> Also assumes FIFO or LRU reuse of ways in a row. If victim way is
> random selected then you need extra ways to add some spare pad and
> the odds in succeeding become statistical.
> 

I ended up with a relatively naive TLB scheme:
   Normal access is simply Mod-N;
     May be XOR'ed by bits from the ASID for part of the VAS range.

Because, yeah, hashing the address may lead edge-case scenarios that 
exceed the capabilities of a 4-way TLB (would need 8-way to fully deal 
with this).

Mod-N and Mod-N XOR ASID, can be statically known that no two adjacent 
pages will map to the same index in the TLB.

Where, for Addr(47:32):
* 0001..3FFF: Mod-N (Global VAS)
* 4000..7FFF: Mod-N or Mod-N ^ ASID (Local, *1);
* 8000..BFFF: Mod-N (Kernel Space)
* C000..CFFF: Physical (NOMMU)
* D000..DFFF: Physical (NOMMU+NoCache)
* E000..EFFF: Reserved, probably PCIe stuff.
* F000..FFFF: MMIO Space

There are no MMU+NoCache ranges, though this may be specified via the PTE's.


*1: Not currently used by TestKern, but the thinking is that 
process-local memory could be allocated in this address range.

Possibly, the global parts of the page table could be shared across 
every process, whereas the top-level of the page-table and local areas 
would be local to each process.


Not sure how this would be mapped to the B-Tree page-tables, but for 
48-bit addressing, conventional page-tables have a lower oeverhead. 
Conventional page tables don't scale well to a 96-bit sparse VAS, but 
the use of a 96-bit address space likely isn't worth the hassle at this 
point in time.

Decided to leave out going into the specifics of the 96-bit VAS wonk, 
but for now I am not bothering with it, as it is too far overkill 
relative to what I am doing here.


Well, and I had managed to get RV ELF binaries working in TestKern's 
existing 48-bit VAS by coercing GCC into building a sort of makeshift 
"static PIE" binaries.

Nevermind if getting stuff working with "actual glibc" is a harder 
problem (eg, might be nicer if I could just pretend all this stuff was a 
RV64G Linux build, not really gonna work though if "ld-linux-so" just 
instantly explodes though).

The other "sorta almost works" strategy being to have BGBCC fake GCC's 
interface enough that one can coerce GNU autoconf into using it as a 
cross compiler (had worked at least for some fairly trivial programs).

Doesn't get that far in a general sense though, as BGBCC doesn't really 
support C++, and even C code almost invariably contains "blatant GCCisms"...



>> With a HW table walker you can just let it evict and reload.
> 

I have on/off considered a HW page walker a few times, but it is mostly 
inertia and cost concern at this point.


The average time spent in the TLB Miss handler is low enough that it 
isn't too much of an issue.

Though, the 256x 4-way TLB (1024 total TLBEs) is apparently "abnormally 
large" for this class of processor.

But, this was because:
   256x 4 with 16K pages: TLB miss rate tends to be pretty low.
   64x 4-way with 16K or 4K pages: TLB miss rate is drastically higher.

Main factor being that the TLB needs to be big enough to cover the main 
part of the working set to keep the rate low, and most of my test 
programs tend to have less than 16MB in the core working set.


But, can also note that currently 32MB is allocated towards virtual 
memory pages, so exceeding 32MB of working set would also lead to a 
sharp increase in page faults (with most of the kernel operating in 
physically mapped pages).

Generally, executable code is using direct-mapping (a part of the 
virtual address space is used, but directly assigned to pages without 
being mapped to the page file)

But, this was mostly because (for semi-unknown reasons) trying to put 
".text" sections into pagefile backed memory is prone to cause stuff to 
explode (*1).


*1: This behavior occurs in both the Verilog implementation and emulator 
and doesn't seem to care which ISA is used. It is most likely a software 
issue as there is little reason the CPU (or emulator) should actually 
care about this (and the I$ is virtual tagged, ...). I remember I tested 
in the past and it didn't matter whether the region was above or below 
the 4GB mark, ...
========== REMAINDER OF ARTICLE TRUNCATED ==========