| Deutsch English Français Italiano |
|
<vnpkfj$14e4b$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Sun, 2 Feb 2025 23:33:33 -0600
Organization: A noiseless patient Spider
Lines: 207
Message-ID: <vnpkfj$14e4b$1@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad>
<2025Feb2.184458@mips.complang.tuwien.ac.at> <vnocer$q8bq$1@dont-email.me>
<vnou2r$t5qd$1@dont-email.me> <YtVnP.202231$HxS1.48250@fx39.iad>
<NNVnP.88582$oCrf.56776@fx33.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 03 Feb 2025 06:33:39 +0100 (CET)
Injection-Info: dont-email.me; posting-host="f4fe680962b51b8bca49b43582a45572";
logging-data="1194123"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/l3IuH2rV1bwF5n+DQP+AyGqfy4UWUzPs="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:/8yfcYRwKUp0FM+X5aILc/K45o4=
In-Reply-To: <NNVnP.88582$oCrf.56776@fx33.iad>
Content-Language: en-US
Bytes: 9592
On 2/2/2025 8:24 PM, EricP wrote:
> EricP wrote:
>> BGB wrote:
>>> On 2/2/2025 12:10 PM, Thomas Koenig wrote:
>>>> Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>>>>
>>>>> The OS must also be able to keep both pages in physical memory until
>>>>> the access is complete, or there will be no progress. Should not be a
>>>>> problem these days, but the 48 pages or so potentially needed by VAX
>>>>> complicated the OS.
>>>>
>>>> 48 pages? What instruction would need that?
>>>
>>> Hmm...
>>>
>>>
>>> I ended up with a 4-way set associative TLB as it ended up being
>>> needed to avoid the CPU getting stuck in a TLB-miss loop in the
>>> worst-case scenario:
>>> An instruction fetch where the line-pair crosses a page boundary (and
>>> L1 I$ misses) for an instruction accessing a memory address where the
>>> line-pair also crosses a page boundary (and the L1 D$ misses).
>>>
>>> One can almost get away with two-way, except that almost inevitably
>>> the CPU would encounter and get stuck in an infinite TLB miss loop
>>> (despite the seeming rarity, happens roughly once every few seconds
>>> or so).
>>>
>>> ....
>>>
>>
>> That is because you have a software managed TLB so all PTE's
>> referenced by an instruction must be resident in TLB for success.
>> If three PTE are required by an instruction and they map to
>> the same 2-way row and conflict evict then bzzzzt livelock loop.
>>
>> So you need at least as many set assoc TLB ways as the worst case VA's
>> referenced by any instruction.
>
> And this just accounts for the instruction that TLB-miss'ed.
> If the TLB-miss handler code or data itself can possibly conflict
> on the same TLB row then you have to add 2, 3 or 4 more ways for it.
>
The interrupt handlers are always run with MMU disabled.
In this case, interrupt handlers may not have any TLB misses.
But, any memory accesses into the virtual address space need to be
emulated in software (via page-walks and a soft-TLB).
If the interrupt handlers ran with MMU enabled, the CPU would also need
to be able to deal with recursive interrupts. At present, this is not a
thing, and the design of the interrupt mechanism can't currently allow
for this (and other interrupts are effectively blocked until the handler
finishes, with a "General Fault" that happens within an interrupt
handler stalling the CPU core until an external RESET signal is asserted).
In most cases, the interrupt handlers are short lived, with more general
long-lived operations (such as syscall handling) being performed via a
context switch.
Currently, page-fault handling does occur within the TLB-miss interrupt,
had gone back and forth as to whether to handle page-fault similar to a
system call, and initiate a context switch to a dedicated page-fault
handler tasks.
Isn't great, but basically works.
> Also assumes FIFO or LRU reuse of ways in a row. If victim way is
> random selected then you need extra ways to add some spare pad and
> the odds in succeeding become statistical.
>
I ended up with a relatively naive TLB scheme:
Normal access is simply Mod-N;
May be XOR'ed by bits from the ASID for part of the VAS range.
Because, yeah, hashing the address may lead edge-case scenarios that
exceed the capabilities of a 4-way TLB (would need 8-way to fully deal
with this).
Mod-N and Mod-N XOR ASID, can be statically known that no two adjacent
pages will map to the same index in the TLB.
Where, for Addr(47:32):
* 0001..3FFF: Mod-N (Global VAS)
* 4000..7FFF: Mod-N or Mod-N ^ ASID (Local, *1);
* 8000..BFFF: Mod-N (Kernel Space)
* C000..CFFF: Physical (NOMMU)
* D000..DFFF: Physical (NOMMU+NoCache)
* E000..EFFF: Reserved, probably PCIe stuff.
* F000..FFFF: MMIO Space
There are no MMU+NoCache ranges, though this may be specified via the PTE's.
*1: Not currently used by TestKern, but the thinking is that
process-local memory could be allocated in this address range.
Possibly, the global parts of the page table could be shared across
every process, whereas the top-level of the page-table and local areas
would be local to each process.
Not sure how this would be mapped to the B-Tree page-tables, but for
48-bit addressing, conventional page-tables have a lower oeverhead.
Conventional page tables don't scale well to a 96-bit sparse VAS, but
the use of a 96-bit address space likely isn't worth the hassle at this
point in time.
Decided to leave out going into the specifics of the 96-bit VAS wonk,
but for now I am not bothering with it, as it is too far overkill
relative to what I am doing here.
Well, and I had managed to get RV ELF binaries working in TestKern's
existing 48-bit VAS by coercing GCC into building a sort of makeshift
"static PIE" binaries.
Nevermind if getting stuff working with "actual glibc" is a harder
problem (eg, might be nicer if I could just pretend all this stuff was a
RV64G Linux build, not really gonna work though if "ld-linux-so" just
instantly explodes though).
The other "sorta almost works" strategy being to have BGBCC fake GCC's
interface enough that one can coerce GNU autoconf into using it as a
cross compiler (had worked at least for some fairly trivial programs).
Doesn't get that far in a general sense though, as BGBCC doesn't really
support C++, and even C code almost invariably contains "blatant GCCisms"...
>> With a HW table walker you can just let it evict and reload.
>
I have on/off considered a HW page walker a few times, but it is mostly
inertia and cost concern at this point.
The average time spent in the TLB Miss handler is low enough that it
isn't too much of an issue.
Though, the 256x 4-way TLB (1024 total TLBEs) is apparently "abnormally
large" for this class of processor.
But, this was because:
256x 4 with 16K pages: TLB miss rate tends to be pretty low.
64x 4-way with 16K or 4K pages: TLB miss rate is drastically higher.
Main factor being that the TLB needs to be big enough to cover the main
part of the working set to keep the rate low, and most of my test
programs tend to have less than 16MB in the core working set.
But, can also note that currently 32MB is allocated towards virtual
memory pages, so exceeding 32MB of working set would also lead to a
sharp increase in page faults (with most of the kernel operating in
physically mapped pages).
Generally, executable code is using direct-mapping (a part of the
virtual address space is used, but directly assigned to pages without
being mapped to the page file)
But, this was mostly because (for semi-unknown reasons) trying to put
".text" sections into pagefile backed memory is prone to cause stuff to
explode (*1).
*1: This behavior occurs in both the Verilog implementation and emulator
and doesn't seem to care which ISA is used. It is most likely a software
issue as there is little reason the CPU (or emulator) should actually
care about this (and the I$ is virtual tagged, ...). I remember I tested
in the past and it didn't matter whether the region was above or below
the 4GB mark, ...
========== REMAINDER OF ARTICLE TRUNCATED ==========