Liste des Groupes | Revenir à c arch |
MitchAlsup1 wrote:Terje Mathisen wrote:
MitchAlsup1 wrote:>In the non-OoO (i.e Pentium) days, I would have inverted the loop in order to hide the latencies as much as possible, resulting in an inner loop something like this:next:
adc eax,ebx
mov ebx,[edx+ecx*4] ; First cyclemov [edi+ecx*4],eax
mov eax,[esi+ecx*4] ; Second cycleinc ecx
jnz next ; Third cycleTerjeAs opposed to::
.global mpn_add_n
mpn_add_n:
MOV R5,#0 // c
MOV R6,#0 // i
VEC R7,{}
LDD R8,[R2,Ri<<3] // Load 128-to-512 bits
LDD R9,[R3,Ri<<3] // Load 128-to-512 bits
CARRY R5,{{IO}}
ADD R10,R8,R9 // Add pair to add octal
STD R10,[R1,Ri<<3] // Store 128-to-512 bits
LOOP LT,R6,#1,R4 // increment 2-to-8 times
RET
--------------------------------------------------------
LDD R8,[R2,Ri<<3] // AGEN cycle 1
LDD R9,[R3,Ri<<3] // AGEN cycle 2 data cycle 4
CARRY R5,{{IO}}
ADD R10,R8,R9 // cycle 4
STD R10,[R1,Ri<<3] // AGEN cycle 3 write cycle 5
LOOP LT,R6,#1,R4 // cycle 3
OR
LDD LDd
LDD LDd ADD
ST STd
LOOP
LDD LDd
LDD LDd ADD
ST STd
LOOP
10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM machine !!
without code scheduling heroics.
40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM machine !!
It all comes down to the carry propagation, right?
The way I understood the original code, you are doing a very wide unsigned add, so you need a carry to propagate from each and every block to the next, right?Most ST pipelines have an align stage to align the data to be stored to where it needs to be stored, one can extend the carry into this stage if needed,
If you can do that at half a clock cycle per 64 bit ADD, then consider me very impressed!
Terje
Les messages affichés proviennent d'usenet.