Sujet : Re: Tonight's tradeoff
De : robfi680 (at) *nospam* gmail.com (Robert Finch)
Groupes : comp.archDate : 07. Mar 2024, 14:25:55
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <uscf92$12ifq$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
User-Agent : Mozilla Thunderbird
On 2024-03-07 1:39 a.m., BGB wrote:
On 3/6/2024 7:28 PM, MitchAlsup1 wrote:
BGB wrote:
>
On 3/6/2024 8:42 AM, Robert Finch wrote:
>
>
>
In my case, access is figured out on cache-line fetch, and is precooked:
NR, NW, NX, NU, NC: NoRead/NoWrite/NoExecute/NoUser/NoCache.
Though, some combinations of these flags are special.
>
Is there a reason these flags (other than user) are inverted ??
{{And even noUser can be changed into Super.}}
>
Historical quirk...
Off-hand, I don't remember why it is this way.
Seems this was one of the parts I designed, but as for why the bits were logically inverted, dunno.
In terms of the main page flags, they are also inverted. But. in terms of VUGID and ACL checks, they are not inverted.
In addition, I think you will want to be able to specify which level of
cache {L1, L2, LLC} this line is stored at, prefetched to, and pushed out
to.
>
Possibly, but not really a thing ATM.
It mostly effects the L1 cache, and (indirectly) the newer set-associative V$ thing.
My 66000 is using ASID instead of something like Super/Global because I
don't want to have to flush the TLB on a hypervisor context switch --
where one GuestOS Super/Global is not the same as another GuestOSs. When a GuestOS is accessing one of its user applications, AGEN automagiaclly
uses application AISD instead of GuestOS ASID. {Similar for HV accessing
GuestOS -- while switching from 1-level translation to 2-level.
>
This is why I have "ASID Groups"...
If normal processes are in ASID Groups 00..1F, and VM's are in groups 38..3F, then global pages in the normal process groups will not be visible in the VM groups (avoiding the need for a TLB flush).
But, yeah, I had debated whether or not to not have global pages at all.
<snip>
>
The L1 cache only hits if the current mode matches the mode that was in effect at the time the cache-line was fetched, and if KRR has not changed (as determined by a hash value), ...
>
s/mode/ASID/
>
Both will effect hit/miss in my case.
User/Supervisor/ISR;
What KRR contains;
Which ISA mode is running;
ASID;
...
All these may cause the L1 caches to miss.
For my system the ACL is not part of the PTE, it is part of the software managed page information, along with share counts. I do not see the ACL for a page being different depending on the page table.
>
>
In my case, ACL handling is done via a combination of keyring register (KRR), and a small fully-associative cache (4 entry at present, 6 could be better in theory; luckily each entry is comparably small).
>
The ACLID is tied to the TLBE, so the intersection of the ACLID and KRR entry are used to figure out access in the ACL cache (or, ignored/disabled if the low 16 bits of KRR are 0).
>
>
I have dedicated some of the block RAMs for the page management information, so they may be read out in parallel with a memory access. So shifted the block RAM usage from the TLB to the PMT. This makes the TLB smaller. It also reduces the memory usage. The page management information only needs one copy for each page of memory. If the information were in the TLBE / PTEs there would be multiple copies of the information in the page tables. How do you keep things coherent if there are multiple copies in page tables?
>
>
>
The access ID for pages is kept in sync with the memory address, since both are uploaded to the TLB at the same time.
>
However, as for ACL checks themselves, these are handled with a separate cache. So, say, changing the access to an ACLID, and flushing the corresponding entry from the ACL cache, will automatically apply to any pages previously loaded into the TLB.
>
There was also the older VUGID system, which used traditional Unix-style permissions. If I were designing it now, would likely design things around using exclusively ACL checking, which (ironically) also needs less bits to encode.
>
>
>
Generally, software TLB miss handling is used in my case.
>
There is no automatic way to keep the TLB in sync with the page table (if the page table entry is modified).
>
My 66000 has a coherent TLB.
>
Usual thing is that if the current page table is updated, then one needs to forge a special dummy entry, and then upload this entry to the TLB multiple times (via the LDTLB instruction) to knock the prior contents out of the TLB (or use the INVTLB instruction, but this currently invalidates the entire TLB; which is a bad situation for software-managed TLB...).
>
See how much easier a coherent TLB is ??
>
Possible, but generally only the kernel is going to be updating the page tables, and the kernel can know that it needs to invoke a special ritual whenever updating the page table to avoid stale page-table entries being used...
Meanwhile, like with coherent caches, coherent TLB would require some sort of "spooky action at a distance" (like, somehow, the TLB needs to know that memory corresponding to a particular part of the page-table was updated).
This is possibly even harder to implement, than something like TSO would be (since, at least with TSO, there is a more obvious correlation between writing to a cache line and needing to have every other copy at the same address first written back to main memory).
Easier from the hardware design front to throw up ones' hands and be like "Yeah, the OS can deal with it somehow...".
Nevermind that apparently my coherence model is even weaker than the RISC-V model, as they are like "well, there is a FENCE" instruction, and I am left not having any good idea with how to deal with it either than "trap and let the trap-handler sort it out..." (presumably by flushing the L1 caches...).
Granted this is a crap solution...
Granted, it appears that "Trap and flush the L1 caches" is still a valid implementation strategy for "Zifencei".
Generally, the assumption is that all pages in a mapping will have the same ACLID (generally corresponding to the "owner" of the mapping).
>
An unsupported assumption if one wants to keep LB flushes minimized.
>
Possible, but this is more for the OS to care about.
The hardware doesn't care either way.
If using multiple page tables for context switching, it will be necessary to use ASIDs.
>
See how much easier it is for HW to perform context switches en massé
>
One could also do a TLB flush on every context switch, but this would ruin performance (and/or require a hardware page-walker to reduce how badly this would ruin performance...).
It is possible to share global pages across "ASID groups", but currently there are not "truly global" pages (and, implicitly, some groups may disallow global pages).
>
Where, say, the ASID is a 16-bit field:
(15:10): ASID Group
( 9: 0): ASID ID
>
At present, for most normal use, the idea is that the ASID and ACL/KRR ID's will be aliased to a process's PID.
>
Not aliased to but accessed from !
>
In the simple case:
A program has its local mappings, which only it will access.
So, one can assume that the PID also serves as the ASID and ACLID, each process having access to mappings it owns, and not to other processes' mappings.
This assumption will break for shared and global mappings.
Say, with Groups 00..1F (in both ASID and ACLID space) being used for the PID aliased range (20..37 for special use, and 38..3F for selective one-off entries).
>
Although completely under SW control, I am assuming that ASID = 0 is the
hypervisor, that ASID = {1..255} is Guest HV, and {256-65535} is for GuestOS
use.
>
No separate HV or Guest OS in this case...
Just kernel running on bare metal.
Kernel can decide to some extent how it uses ASID's/etc, apart from the 6.10 split being currently fixed in hardware, where setting a page as global will cause the low 10 bits of the ASID to be ignored.
Possibly the OS could enforce that Global not be set for groups 38..3F, but this is more OS responsibility.
Currently, threads also eat PID's, but this is likely to change, say:
TPID (Task ID):
(31:16): PID
(15: 0): Thread ID (local to a given PID)
>
PIDs are GuestOS defined and used.
OS defined here as well, only that these is a single level.
I guess it is possible I could consider trying to write a wrapper layer to try to boot an RV64 Linux distro on top of my existing CPU design, effectively faking the SiFive style memory map and Privileged Spec stuff in software (effectively running the whole OS in virtual memory and in usermode).
In other news, my recent attempts to improve code-density seem to be instead be improving performance (though, recently, via various compiler tweaks, have shaved around 5% off the size of the binaries).
Currently the delta in ".text" size has dropped to around 11% (for the PEL binaries, ".text" is still around 11% bigger than ".text" in RV64G ELF binaries, though the PEL binary as a whole is significantly smaller than the ELF binaries).
Currently looking at a 308K binary for Doom.
Meanwhile, Doom has gotten a little faster (29.5 vs 18.0 fps).
As of the most recent test, slightly ahead in Dhrystone as well:
XG2 : 94K
RV64: 88K
Most of this seems due mostly to fiddling with stuff in my compiler.
Mostly stuff related to register handling and similar:
For operations targeting function arguments, target the output directly to the argument register when possible;
Allow temporary variables originating from function returns to be temporarily assigned to R2 rather than allocating a normal callee save register and copying R2 into this register;
Storing a register to a variable that is not currently held in a register may store directly to the target memory location rather than pulling the variable into a register (and then spilling the register), if there are no subsequent uses of this variable in the same basic block;
...
Though, ironically, some of these register assignment tricks are hindered by the globally assigned registers, where a variable with a globally assigned register can't also be temporarily stored in a different register.
Though, a potential option being to add a mechanism to allow temporarily putting a static-assigned register into a scratch register despite its "base of operations" having been static-assigned to a different register. At presently, the register allocation logic lacks a way for a variable to be assigned to two locations at once though.
Well, along with occasionally finding stuff where code generation was "pointlessly inefficient" for whatever reason, such as by not making effective use of existing instructions or encodings (a lot of cases being stuff that wasn't available in earlier versions of the ISA).
Occasionally, it is tweaking rules for when to use one encoding over another.
But, increasingly, this is getting into stuff that is "kind of a pain".
I don't know how much further I will get in this direction.
...
Bigfoot uses a whole byte for access rights, with separate read-write-execute for user and supervisor, and write protect for hypervisor and machine modes. He also uses 4 bits for the cache-ability which match the cache-ability bits on the bus.
I got rid of the global bit after reading some posts. IMO I do not think it buys a lot. It might make a difference if the TLB is small, but Bigfoot’s TLB has 1024 entries. A global page is just a page shared to every process. It could be handled like any other shared page. The page share count would be a big number. Bigfoot uses just 16-bits for the share count.
I am using the value of zero for the ASID to represent the machine mode’s ASID. A lot of hardware is initialized to zero at reset, so it’s automatically the machine mode’s. Other than zero the ASID could be anything assigned by the OS.