Sujet : Re: "Mini" tags to reduce the number of op codes
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 15. Apr 2024, 21:55:53
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <983c789e7c6d9f3ca4ffe40fdc3aa709@www.novabbs.org>
References : 1 2 3 4
User-Agent : Rocksolid Light
Terje Mathisen wrote:
MitchAlsup1 wrote:
In the non-OoO (i.e Pentium) days, I would have inverted the loop in order to hide the latencies as much as possible, resulting in an inner loop something like this:
next:
adc eax,ebx
mov ebx,[edx+ecx*4] ; First cycle
mov [edi+ecx*4],eax
mov eax,[esi+ecx*4] ; Second cycle
inc ecx
jnz next ; Third cycle
Terje
As opposed to::
.global mpn_add_n
mpn_add_n:
MOV R5,#0 // c
MOV R6,#0 // i
VEC R7,{}
LDD R8,[R2,Ri<<3] // Load 128-to-512 bits
LDD R9,[R3,Ri<<3] // Load 128-to-512 bits
CARRY R5,{{IO}}
ADD R10,R8,R9 // Add pair to add octal
STD R10,[R1,Ri<<3] // Store 128-to-512 bits
LOOP LT,R6,#1,R4 // increment 2-to-8 times
RET
--------------------------------------------------------
LDD R8,[R2,Ri<<3] // AGEN cycle 1
LDD R9,[R3,Ri<<3] // AGEN cycle 2 data cycle 4
CARRY R5,{{IO}}
ADD R10,R8,R9 // cycle 4
STD R10,[R1,Ri<<3] // AGEN cycle 3 write cycle 5
LOOP LT,R6,#1,R4 // cycle 3
OR
LDD LDd
LDD LDd ADD
ST STd
LOOP
LDD LDd
LDD LDd ADD
ST STd
LOOP
10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM machine !!
without code scheduling heroics.
40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM machine !!