Liste des Groupes | Revenir à c arch |
MitchAlsup1 wrote:In the same bad old days, the standard way to speed it up would have used unrolling, but until we got more registers, it would have stopped itself very quickly. With AVX2 we could use 4 64-bit slots in a 32-byte register, but then we would have needed to handle the carry propagation manually, and that would take longer than a series of ADC/ADX instructions.Anton Ertl wrote:; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=-n
>I have a similar problem for the carry and overflow bits in>
< http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
let those bits not survive across calls; if there was a cheap solution
for the problem, it would eliminate this drawback of my idea.
My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
whereas RISC-V encodes the inner loop in 11 instructions.
>
Source code:
>
void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
{
   uint64_t c = 0;
   for( int i = 0; i < n; i++ )
   {
        {c, sum[i]} = a[i] + b[i] + c;
   }
   return
}
>
Assembly code::
>
   .global mpn_add_n
mpn_add_n:
   MOV  R5,#0    // c
   MOV  R6,#0    // i
>
   VEC  R7,{}
   LDD  R8,[R2,Ri<<3]
   LDD  R9,[R3,Ri<<3]
   CARRY R5,{{IO}}
   ADD  R10,R8,R9
   STD  R10,[R1,Ri<<3]
   LOOP LT,R6,#1,R4
   RET
>
So, adding a few "bells and whistles" to RISC-V does give you a
performance gain (1.38×); using a well designed ISA gives you a
performance gain of 2.00× !! {{moral: don't stop too early}}
>
Note that all the register bookkeeping has disappeared !! because
of the indexed memory reference form.
>
As I count executing instructions, VEC does not execute, nor does
CARRY--CARRY causes the subsequent ADD to take C input as carry and
the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
BC sequence in a single instruction and in a single clock.
xor rax,rax ;; Clear carry
next:
mov rax,[rsi+rcx*8]
adc rax,[rdx+rcx*8]
mov [rdi+rcx*8],rax
inc rcx
jnz next
The code above is 5 instructions, or 6 if we avoid the load-op, doing two loads and one store, so it should only be limited by the latency of the ADC, i.e. one or two cycles.
In the non-OoO (i.e Pentium) days, I would have inverted the loop in order to hide the latencies as much as possible, resulting in an inner loop something like this:
next:
adc eax,ebx
mov ebx,[edx+ecx*4] ; First cycle
mov [edi+ecx*4],eax
mov eax,[esi+ecx*4] ; Second cycle
inc ecx
jnz next ; Third cycle
Les messages affichés proviennent d'usenet.