Liste des Groupes | Revenir à c arch |
Terje Mathisen wrote:EricP wrote:
Codecs likely have to deal with double-width straddles a lot, whatever
the register word size. So for them it likely happens at 64-bits already.
Nothing likely about it: LZ4 is pretty much the only compression
algorithm/lossless codec that never straddles, all the rest tend to
treat the source data as single bitstream of arbitrary length, except
for some built-in chunking mechanism which simplifies faster scanning.
The core of the algorithm always starts with knowing the endianness,
then picking up 32 or 64-bit chunks of input data (byte-flipping if
needed) and then extractin the next N bits either from the top of bottom
of the buffer register.
AlLmost by definition, this is not code that a compiler is setup to help
you get correct.
I added a bunch of instructions for dealing with double-width operations.
The main ISA design decision is whether to have register pair specifiers,
R0, R2, R4,... or two separate {r_high,r_low} registers.
In either case the main uArch issue is that now instructions have an
extra
source register and two dest registers, which has a number of
consequences.
But once you bite the bullet on that it simplifies a lot of things,
like how to deal with carry or overflow without flags,
full width multiplies, divide producing both quotient and remainder.
Very nice!
This means that you can do integer IMAC(), right?
(hi, lo) = imac(a, b, c); // == a*b+c
The only thing even nicer from the perspective of writing arbitrary
precision library code would be IMAA, i.e. a*b+c+d since that is the
largest combination which is guaranteed to never overflow the double
register target field.
Terje
I thought about IMAC but it was a bit too much.
And unlike FMA there is no precision gain in IMAC, just convenience.
IMAC requires 6 register specifiers, 2 dest and 4 source if you don't
care about overflow/carry on the accumulate.
2-wide = 2-wide + narrow * narrow
It needs 7 registers, 3 dest and 4 source if you want overflow/carry
on the accumulate.
3-wide = 2-wide + narrow * narrow
I wanted to support checked arithmetic which means full width multiplies.
And I was always bothered by the risc approach of MULL (low part) and
MULH (high part) where they do most of the multiply then toss half away
just because they won't have 2 dest registers.
So what else I can do with 2 dest registers? Wide add and sub.
Various wide Add,Sub solves the missing carry/overflow flags problems.
FMA already requires 3 source registers.
Beside Add,Sub,Mul what else can one do with 3 source and 2 dest registers?
Wide shifts and wide bit-field extract and insert.
I went with two (r_hi,r_lo) register specifiers because it gave programmers
more flexibility. I played a bit with even register pairs (R0, R2, R4...)
and found one had to do extra MOVs just form a pair.
(r_hi,r_lo) cost a longer instruction format but I have a variable length
instruction so its mostly a wider fetch and decode pathways to handle
the worst case instruction size.
W = Wide = (hi,lo) register pair, N = Narrow = one register.
Add forms:
Add N = N + N // No carry out
Add3 N = N + N + N // No carry out
Addw2 W = N + N // Generate carry
Addw3 W = N + N + N // Generate + propagate carry
Addw1 W = W + N // Propagate carry
Same for subtract wide.
The three Add forms are chosen to make multi-precision integer
multiply easier. See below.
MUluw W = N * N
Mulsw W = N * N
Divuw (quo,rem) = N / N
Divsw (quo,rem) = N / N
Shllw W = W << size // Shift left logical
Shlaw W = W << size // Shift left arithmetic, fault on signed overflow
Shrlw W = W >> size // Shift right logical
Shraw W = W >> size // Shift right arithmetic, sign extend
Shrnw W = W >> size // Shift right numeric, round -1 to zero
Bfextu N = extract (W, size, position) // Bit-field extract, zero extend
Bfexts N = extract (W, size, position) // Bit-field extract, sign extend
Bfins W = insert (W, N, size, position) // Bit-field insert
=====================================
Example unsigned 128 * 128 => 256 multiply:
// Unsigned Multiply 128*128 => 256
// (r3,r2)*(r1,r0) => (r3,r2,r1,r0)
// Uses r4,r5,r6,r7,r8 as temp registers
//
muluw r5,r4 = r3*r0
muluw r6,r0 = r2*r0
muluw r8,r7 = r2*r1
muluw r3,r2 = r3*r1
addw3 r4,r1 = r4+r6+r7
addw3 r5,r2 = r5+r8+r2
addw2 r4,r2 = r2+r4
add3 r3 = r3+r5+r4
The reason I prefer the separate (r_hi,r_lo) pair specifiers rather
than the even number register pairs R0,R2,R4... is because the above
sequence would require extra moves for form the even numbered pairs.
With separate pairs one can select registers so that everything lands
in the right dest at the right time.
Les messages affichés proviennent d'usenet.