Re: Tonights Tradeoff

Liste des GroupesRevenir à c arch 
Sujet : Re: Tonights Tradeoff
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.arch
Date : 10. Sep 2024, 09:00:00
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vboqpp$2r5v4$1@dont-email.me>
References : 1 2 3 4 5
User-Agent : Mozilla Thunderbird
On 9/9/2024 10:59 PM, Robert Finch wrote:
On 2024-09-08 2:06 p.m., MitchAlsup1 wrote:
On Sun, 8 Sep 2024 3:22:55 +0000, Robert Finch wrote:
>
On 2024-09-07 10:41 a.m., MitchAlsup1 wrote:
On Sat, 7 Sep 2024 2:27:40 +0000, Robert Finch wrote:
>
Making the scalar register file a subset of the vector register file.
And renaming only vector elements.
>
There are eight elements in a vector register and each element is
128-bits wide. (Corresponding to the size of a GPR). Vector register
file elements are subject to register renaming to allow the full power
of the OoO machine to be used to process vectors. The issue is that with
both the vector and scalar registers present for renaming there are a
lot of registers to rename. It is desirable to keep the number of
renamed registers (including vector elements) <= 256 total. So, the 64
scalar registers are aliased with the first eight vector registers.
Leaving only 24 truly available vector registers. Hm. There are 1024
physical registers, so maybe going up to about 300 renamable register
would not hurt.
>
Why do you think a vector register file is the way to go ??
>
I think vector use is somewhat dubious, but they have some uses. In many
cases data can be processed just fine without vector registers. In the
current project vector instructions use the scalar functional units to
compute, making them no faster than scalar calcs. But vectors have a lot
of code density where parallel computation on multiple data items using
a single instruction is desirable. I do not know why people use vector
registers in general, but they are present in some modern architectures.
>
There is no doubt that much code can utilize vector arrangements, and
that a processor should be very efficient in performing these work
loads.
>
The problem I see is that CRAY-like vectors vectorize instructions
instead of vectorizing loops. Any kind of flow control within the
loop becomes tedious at best.
>
On the other hand, the Virtual Vector Method vectorizes loops and
can be implemented such that it performs as well as CRAY-like
vector machines without the overhead of a vector register file.
In actuality there are only 6-bits of HW flip-flops governing
VVM--compared to 4 KBytes for CRAY-1.
>
Qupls vector registers are 512 bits wide (8 64-bit elements). Bigfoot’s
vector registers are 1024 bits wide (8 128-bit elements).
>
When properly abstracted, one can dedicate as many or few HW
flip-flops as staging buffers for vector work loads to suit
the implementation at hand. A GBOoO may utilize that 4KB
file of CRAY-1 while the little low power core 3-cache lines.
Both run the same ASM code and both are efficient in their own
sense of "efficient".
>
So, instead of having ~500 vector instructions and ~1000 SIMD
instructions one has 2 instructions and a medium scale state
machine.
>
  Still trying to grasp the virtual vector method. Been wondering if it can be implemented using renamed registers.
 
I haven't really understood how it could be implemented.
But, granted, my pipeline design is relatively simplistic, and my priority had usually been trying to make a "fast but cheap and simple" pipeline, rather than a "clever" pipeline.
Still not as cheap or simple as I would want.

Qupls has RISC-V style vector / SIMD registers. For Q+ every instruction can be a vector instruction, as there are bits indicating which registers are vector registers in the instruction. All the scalar instructions become vector. This cuts down on some of the bloat in the ISA. There is only a handful of vector specific instructions (about eight I think). The drawback is that the ISA is 48-bits wide. However, the code bloat is less than 50% as some instructions have dual- operations. Branches can increment or decrement and loop. Bigfoot uses a postfix word to indicate to use the vector form of the instruction. Bigfoot’s code density is a lot better being variable length, but I suspect it will not run as fast. Bigfoot and Q+ share a lot of the same code. Trying to make the guts of the cores generic.
 
In my case, the core ended up generic enough that it can support both BJX2 and RISC-V. Could almost make sense to lean more heavily into this (trying to consolidate more things and better optimize costs).
Did also recently get around to more-or-less implementing support for the 'C' extension, even as much as it is kinda dog-chewed and does not efficiently utilize the encoding space.
It burns a lot of encoding space on 6 and 8 bit immediate fields (with 11 bit branch displacements), more 5-bit register fields than ideal, ... so, has relatively few unique instructions, but:
Many of the instructions it does have are left with 3 bit register fields;
Has way a bit too many immediate-field layouts as it just sort of shoe-horns immediate fields into whatever bits are left.
Though, turns out I could skip a few things due to them being N/E in RV64 (RV32, RV64, and RV128 get a slightly different selection of ops in the C extension).
Like, many things in RV land make "annoying and kinda poor" design choices.
Then again, if one assumes that the role of 'C' is mostly:
   Does SP-relative loads/stores and MOV-RR.
Well, it does do this at least...
Nevermind if you want to use any of the ALU ops (besides ADD), or non-stack-relative Load/Store, well then, enjoy the 3 bit register fields.
And, still way too many immediate-field encodings for what is effectively load/store and a few ALU ops.
I am not as much a fan of RISC-V's 'V' extension mostly in that it would require essentially doubling the size of the register file.
And, if I were to do something like 'V' I would likely do some things differently:
Rather than having an instruction to load vector control state into CSR's, it would make more sense IMO to use bigger 64-bit instructions and encode the vector state directly into these instructions.
While this would be worse for code density, it would avoid needing to burn instructions setting up vector state, and would have less penalty (in terms of clock-cycles) if working with heterogeneous vectors.
Say, one possibility could be a combo-SIMD op with a control field:
   2b vector size
     64 / 128 / resv / resv
   2b element size
     8 / 16/ 32/ 64
   2b category
     wrap / modulo
     float
     signed saturate
     unsigned saturate
   6b operator
     add, sub, mul, mac, mulhi, ...
Though, with not every combination necessarily being allowed.
Say, for example, if the implementation limits FP-SIMD to 4 or 8 vector elements.
Though, it may make sense to be asymmetric as well:
   2-vide vectors can support Binary64
   4-wide can support Binary32
   8-wide can support Binary16 ( + 4x FP16 units)
   16 can support FP8 ( + 8x FP8 units)
Whereas, say, 16x Binary32 capable units would be infeasible.
Well, as opposed to defining encodings one-at-a-time in the 32-bit encoding space.
It could be tempting to possibly consider using pipelining and multi-stage decoding to allow some ops as well. Say, possibly handling 8-wide vectors internally as 2x 4-wide operations, or maybe allowing 256-bit vector ops in the absence of 256-bit vectors in hardware.
...

Date Sujet#  Auteur
7 Sep 24 * Tonights Tradeoff25Robert Finch
7 Sep 24 `* Re: Tonights Tradeoff24MitchAlsup1
8 Sep 24  `* Re: Tonights Tradeoff23Robert Finch
8 Sep 24   `* Re: Tonights Tradeoff22MitchAlsup1
10 Sep 24    `* Re: Tonights Tradeoff21Robert Finch
10 Sep 24     +* Re: Tonights Tradeoff17BGB
10 Sep 24     i+* Re: Tonights Tradeoff12Robert Finch
10 Sep 24     ii+* Re: Tonights Tradeoff10BGB
11 Sep 24     iii`* Re: Tonights Tradeoff9Robert Finch
11 Sep 24     iii +* Re: Tonights Tradeoff7Stephen Fuld
11 Sep 24     iii i+- Re: Tonights Tradeoff1MitchAlsup1
12 Sep 24     iii i`* Re: Tonights Tradeoff5Robert Finch
12 Sep 24     iii i `* Re: Tonights Tradeoff4MitchAlsup1
12 Sep 24     iii i  `* Re: Tonights Tradeoff3Robert Finch
12 Sep 24     iii i   `* Re: Tonights Tradeoff2MitchAlsup1
13 Sep 24     iii i    `- Re: Tonights Tradeoff1MitchAlsup1
12 Sep 24     iii `- Re: Tonights Tradeoff1BGB
11 Sep 24     ii`- Re: Tonights Tradeoff1MitchAlsup1
11 Sep 24     i`* Re: Tonights Tradeoff4MitchAlsup1
12 Sep 24     i `* Re: Tonights Tradeoff3Thomas Koenig
12 Sep 24     i  `* Re: Tonights Tradeoff2BGB
12 Sep 24     i   `- Re: Tonights Tradeoff1Robert Finch
11 Sep 24     `* Re: Tonights Tradeoff3MitchAlsup1
15 Sep09:13      `* Re: Tonights Tradeoff2Robert Finch
16 Sep07:45       `- Re: Tonights Tradeoff1Robert Finch

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal