Re: Microarch Club

Liste des GroupesRevenir à c arch 
Sujet : Re: Microarch Club
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.arch
Date : 26. Mar 2024, 05:32:13
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <uttfk3$1j3o3$1@dont-email.me>
References : 1 2 3
User-Agent : Mozilla Thunderbird
On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
 
On 3/21/2024 2:34 PM, George Musk wrote:
Thought this may be interesting:
https://microarch.club/
https://www.youtube.com/@MicroarchClub/videos
 
At least sort of interesting...
 
I guess one of the guys on there did a manycore VLIW architecture with the memory local to each of the cores. Seems like an interesting approach, though not sure how well it would work on a general purpose workload. This is also closer to what I had imagined when I first started working on this stuff, but it had drifted more towards a slightly more conventional design.
 
But, admittedly, this is for small-N cores, 16/32K of L1 with a shared L2, seemed like a better option than cores with a very large shared L1 cache.
 You appear to be "starting to get it"; congratulations.
 
I had experimented with stuff before, and "big L1 caches" seemed to be in most regards worse. Hit rate goes into diminishing return territory, and timing isn't too happy either.
At least for my workloads, 32K seemed like the local optimum.
Say, checking hit rates (in Doom):
     8K: 67%,  16K: 78%,
    32K: 85%,  64K: 87%
   128K: 88%
This being for a direct-mapped cache configuration with even/odd paired 16-byte cache lines.
Other programs seem similar.
For a direct-mapped L1 cache, there is an issue with conflict misses, where I was able to add in a small cache to absorb ~ 1-2% that was due to conflict misses, which also had the (seemingly more obvious) effect of reducing L2 misses (from a direct-mapped L2 cache). Though, it is likely that a set-associative L2 cache could have also addressed this issue (but likely with a higher cost impact).

I am not sure that abandoning a global address space is such a great idea, as a lot of the "merits" can be gained instead by using weak coherence models (possibly with a shared 256K or 512K or so for each group of 4 cores, at which point it goes out to a higher latency global bus). In this case, the division into independent memory regions could be done in software.
 Most of the last 50 years has been towards a single global address space.
 
Yeah.
 From what I can gather, the guy in the video had an architecture which gives each CPU its own 128K and needs explicit message passing to access outside of this (and faking a global address space in software, at a significant performance penalty). As I see it, this does not seem like such a great idea...
Something like weak coherence can get most of the same savings, with much less impact on how one writes code (albeit, it does mean that mutex locking may still be painfully slow).
But, this does mean it is better to try to approach software in a way that neither requires TSO semantics nor frequent mutex locking.

It is unclear if my approach is "sufficiently minimal". There is more complexity than I would like in my ISA (and effectively turning it into the common superset of both my original design and RV64G, doesn't really help matters here).
 
If going for a more minimal core optimized for perf/area, some stuff might be dropped. Would likely drop integer and floating-point divide
 I think this is pound foolish even if penny wise.
 
The "shift and add" unit isn't free, and the relative gains are small.
For integer divide, granted, it is faster than the pure software version in the general case. For FPU divide, N-R is faster, but shift-add can give an exact result. Most other / faster hardware divide strategies seem to be more expensive than a shift-and-add unit.
My confidence in hardware divide isn't too high, noting for example that the AMD K10 and Bulldozer/15h had painfully slow divide operations (to such a degree that doing it in software was often faster). This implies that divide cost/performance is still not really a "solved" issue, even if one has the resources to throw at it.
One can avoid the cost of the shift-and-add unit via "trap and emulate", but then the performance is worse.
Say, "we have an instruction, but it is a boat anchor" isn't an ideal situation (unless to be a placeholder for if/when it is not a boat anchor).

again. Might also make sense to add an architectural zero register, and eliminate some number of encodings which exist merely because of the lack of a zero register (though, encodings are comparably cheap, as the
 I got an effective zero register without having to waste a register name to "get it". My 66000 gives you 32 registers of 64-bits each and you can put any bit pattern in any register and treat it as you like.
Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
available.
 
I guess offloading this to the compiler can also make sense.
Least common denominator would be, say, not providing things like NEG instructions and similar (pretending as-if one had a zero register), and if a program needs to do a NEG or similar, it can load 0 into a register itself.
In the extreme case (say, one also lacks a designated "load immediate" instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy to zero a register...
Say:
   XOR R14, R14, R14  //Designate R14 as pseudo-zero...
   ...
   ADD R14, 0x123, R8  //Load 0x123 into R8
Though, likely still makes sense in this case to provide some "convenience" instructions.

internal uArch has a zero register, and effectively treats immediate values as a special register as well, ...). Some of the debate is more related to the logic cost of dealing with some things in the decoder.
 The problem is universal constants. RISCs being notably poor in their
support--however this is better than addressing modes which require
µCode.
 
Yeah.
I ended up with jumbo-prefixes. Still not perfect, and not perfectly orthogonal, but mostly works.
Allows, say:
   ADD R4, 0x12345678, R6
To be performed in potentially 1 clock-cycle and with a 64-bit encoding, which is better than, say:
   LUI X8, 0x12345
   ADD X8, X8, 0x678
   ADD X12, X10, X8
Though, for jumbo-prefixes, did end up adding a special case in the compile where it will try to figure out if a constant will be used multiple times in a basic-block and, if so, will load it into a register rather than use a jumbo-prefix form.
It could maybe make sense to have function-scale static-assigned constants, but have not done so yet.
Though, it appears as if one of the "top contenders" here would be 0, mostly because things like:
   foo->x=0;
And:
   bar[i]=0;
Are semi-common, and as-is end up needing to load 0 into a register each time they appear.
Had already ended up with a similar sort of special case to optimize "return 0;" and similar, mostly because this was common enough that it made more sense to have a special case:
   BRA .lbl_ret  //if function does not end with "return 0;"
   .lbl_ret_zero:
   MOV 0, R2
   .lbl_ret:
   ... epilog ...
For many functions, which allowed "return 0;" to be emitted as:
   BRA .lbl_ret_zero
Rather than:
   MOV 0, R2
   BRA .lbl_ret
Which on average ended up as a net-win when there are more than around 3 of them per function.
Though, another possibility could be to allow constants to be included in the "statically assign variables to registers" logic (as-is, they are excluded except in "tiny leaf" functions).

Though, would likely still make a few decisions differently from those in RISC-V. Things like indexed load/store,
 Absolutely
 
                                           predicated ops (with a designated flag bit),
 Predicated then and else clauses which are branch free.
{{Also good for constant time crypto in need of flow control...}}
 
I have per instruction predication:
   CMPxx ...
   OP?T  //if-true
   OP?F  //if-false
Or:
   OP?T | OP?F  //both in parallel, subject to encoding and ISA rules
Performance gains are modest, but still noticeable (part of why predication ended up as a core ISA feature). Effect on pipeline seems to be small in its current form (it is handled along with register fetch, mostly turning non-executed instructions into NOPs during the EX stages).
For the most part, 1-bit seems sufficient.
More complex schemes generally ran into issues (had experimented with allowing a second predicate bit, or handling predicates as a stack-machine, but these ideas were mostly dead on arrival).

                      and large-immediate encodings,
 Nothing else is so poorly served in typical ISAs.
 
Probably true.

                                                     help enough with performance (relative to cost)
 +40%
 
I am mostly seeing around 30% or so, for Doom and similar.
   A few other programs still being closer to break-even at present.
Things are a bit more contentious in terms of code density:
   With size-minimizing options to GCC:
     ".text" is slightly larger with BGBCC vs GCC (around 11%);
     However, the GCC output has significantly more ".rodata".
A reasonable chunk of the code-size difference could be attributed to jumbo prefixes making the average instruction size slightly bigger.
More could be possible with more compiler optimization effort. Currently, a few recent optimization cases are disabled as they seem to be causing bugs that I haven't figured out yet.

                               to be worth keeping (though, mostly because the alternatives are not so good in terms of performance).
 Damage to pipeline ability less than -5%.
Yeah.

Date Sujet#  Auteur
21 Mar 24 * Microarch Club22George Musk
25 Mar 24 `* Re: Microarch Club21BGB-Alt
26 Mar 24  `* Re: Microarch Club20MitchAlsup1
26 Mar 24   `* Re: Microarch Club19BGB
26 Mar 24    `* Re: Microarch Club18MitchAlsup1
26 Mar 24     `* Re: Microarch Club17BGB-Alt
27 Mar 24      +* Re: Microarch Club12Michael S
27 Mar 24      i`* Re: Microarch Club11BGB
27 Mar 24      i `* Re: Microarch Club10MitchAlsup1
28 Mar 24      i  +* Re: Microarch Club4Michael S
2 Apr 24      i  i`* Re: Microarch Club3BGB-Alt
5 Apr 24      i  i `* Re: Microarch Club2MitchAlsup1
6 Apr 24      i  i  `- Re: Microarch Club1BGB
28 Mar 24      i  +- Re: Microarch Club1MitchAlsup1
28 Mar 24      i  `* Re: Microarch Club4Terje Mathisen
28 Mar 24      i   `* Re: Microarch Club3Michael S
29 Mar 24      i    `* Re: Microarch Club2Terje Mathisen
29 Mar 24      i     `- Re: Microarch Club1Michael S
27 Mar 24      `* Re: Microarch Club4MitchAlsup1
27 Mar 24       `* Re: Microarch Club3BGB
27 Mar 24        `* Re: Microarch Club2MitchAlsup1
1 Apr 24         `- Re: Microarch Club1BGB

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal