Stephen Fuld <
sfuld@alumni.cmu.edu.invalid> wrote:
On 1/6/2025 6:11 PM, Waldek Hebisch wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
I also think code would be a bit more efficient if there more registers
available for parameter passing and as scratch registers - perhaps 6
would make more sense.
>
Basically, here, there is competing pressure between the compiler
needing a handful of preserved registers, and the compiler being
more efficient if there were more argument/result passing registers.
>
My 66000 ABI has 8 argument registers, 7 temporary registers, 14
preserved registers, a FP, and a SP. IP is not part of the register
file. My ABI has a note indicating that the aggregations can be
altered, just that I need a good reason to change.
>
I looked high and low for codes using more than 8 arguments and
returning aggregates larger than 8 double words, and about the
only things I found were a handful of []print[]() calls.
I meet such code with reasonable frequency. I peeked semi
randomly into Lapack. First routine that I looked at had
8 arguments, so within your limit. Second is:
SUBROUTINE ZUNMR3( SIDE, TRANS, M, N, K, L, A, LDA, TAU, C, LDC,
$ WORK, INFO )
which has 13 arguments.
Large number of arguments is typical in old style Fortran numeric
code.
While there has been much discussion down thread relating to Waldek's
other points, there hasn't been much about these.
So, some questions. Has Lapack (and the other old style Fortran numeric
code that Waldek mentioned) lost its/their importance as a major user of
CPU cycles? Or do these subroutines consume so many CPU cycles that the
overhead of the large number of parameters is lost in the noise? Or is
there some other explanation for Mitch not considering their importance?
Some comments to this:
You are implicitely assuming that passing large number of
arguments is expensive. Of course, if you can do the job with
smaller number of arguments, then there may be some saving.
However, large number of arguments is partially to increase
performance. Let me illustracte this with a example having
smaller number of arguments. I have a routine which is shortly
described below:
++ vector_combination(v1, c1, v2, c2, n, delta, p) replaces
++ first n + 1 entries of v1 by corresponding entries of
++ c1*v1+c2*x^delta*v2 mod p.
There 7 arguments here and it only deals with vectors (one
dimensional arrays). Instead of routine above I could use
5 separate routines, one to extrace a subvector, one shifting
entries, one mutiplying vector by a scalar, one for addition
and one for replacing a subvector. Using separate routines
would be take roughly 3-5 times more time and require
intermediate storage. Dynamically allocating this storage
would decrease performance and reusing statically allocated
work vectors would significantly complicate the code. And
of course having 5 calls instead of a single one also would
complicate code. So basically, I can use a routine with
large number of arguments which is doing more work and have
simpler and faster code or I could use "simpler" routines with
small number of arguments and get more complicated and slower
code.
My routine above was for vectors, similar routine for arrays
would have larger number of parameters, exceeding 8. Actually,
already more general routine for vectors would have extra
parameter to specify starting index (which currently is
assument do be 0 and is the only case that I need).
In case of Lapack, reasonably typical case is routine operating
on subblock of an array, which means that an array (subblock) is
described by 4 arguments: pointer to first element, leading
dimension (that corresponding dimension of containing array) and
2 dimensions of the array. Some dimensions may be shared, but
clearly even in simplest case there are several parameters.
There may be additional numeric parameters, work areas, parameters
specifiying if array is transposed or not (otherwise there would
be need of separate routines or user would be forced to separate
call of matrix transposition). There are is convention of
returning information about possible errors in 'INFO' variable.
Lapack has inefficiency due to Fortran conventions. Namely,
in natural C interface most arguments would be by value,
but Fortran compilers pass arguments by reference. So
even if all machine-level arguments were passed in registers,
values still need to be saved in memory by the caller and
read back by the called routine.
Modern languages have support for records/structures, so at source
code level number of arguments may be smaller. However, passing
structures by address is efficient when structures are only
passed down (quite typical case in modern code, where data goes
trough several layers before doing real work), but inccurs cost
when there is actual access. Passing structures by value means
that optically number of parameters is smaller, but there is
still need to pass several values.
Concerning what machine architects do: for long time goal was
high _average_ performance, based on some prediciton of load.
Large number of arguments is reasonable frequent in scientific
codes. Modern tendency is to pass addresses of aggregates, as
that is better behaved in OO contexts. I am not aware of any
publically available substantial body of realistic COBOL code,
but reasonable guess is that COBOL routines do quite a lot of
work between calls. In non-OO non-functional context compiler
can inline small routines, effectively leading to case where
calls are relatively rare. AFIAR, in initial AMD-64 gcc port
Suse team that did it claimed about 2-3% better performace due
to complicated calling convention trying to optimize use
of registers. In particular that measured object code size
of large body of Linux programs (they had no real hardware,
so were unable to measure code speed) and optimed convention
based on this. Later, Intel team claimed that due to improved
inlining calls were rare and effect of calling convention was
of order of fraction of percent. Of course, AMD-64 is limited
by its 16 general purpose registers, on machine with more registers
one can pass more arguments in registers, but I doubt that it
pays to go above 10-12. OTOH I think that having more
return registers (i mean comparable number to argument-passing
registers) would improve performance, but probably
code returning many values is so rare that architecs do not
care much.
-- Waldek Hebisch