On 8/28/2024 11:40 AM, MitchAlsup1 wrote:
On Wed, 28 Aug 2024 3:33:40 +0000, BGB wrote:
On 8/27/2024 6:50 PM, MitchAlsup1 wrote:
On Tue, 27 Aug 2024 22:39:02 +0000, BGB wrote:
>
On 8/27/2024 2:59 PM, John Dallman wrote:
In article <vajo7i$2s028$1@dont-email.me>, tkoenig@netcologne.de (Thomas
Koenig) wrote:
>
Just read that some architects are leaving Intel and doing their own
startup, apparently aiming to develop RISC-V cores of all things.
>
They're presumably intending to develop high-performance cores, since
they have substantial experience in doing that for x86-64. The question
is if demand for those will develop.
>
>
Making RISC-V "not suck" in terms of performance will probably at least
be easier than making x86-64 "not suck".
>
Yet, these people have decades of experience building complex things
that
made x86 (also() not suck. They should have the "drawing power" to get
more people with similar experiences.
>
The drawback is that they are competing with "everyone else in
RISC-V-land,
and starting several years late.
>
Though, if anything, they probably have the experience to know how to
make things like the fabled "opcode fusion" work without burning too
many resources.
>
>
>
Android is apparently waiting for a new RISC-V instruction set
extension; >> you can run various Linuxes, but I have not heard
about anyone wanting to do so on a large scale.
>
>
My thoughts for "major missing features" is still:
Needs register-indexed load;
Needs an intermediate size constant load (such as 17-bit sign extended)
in a 32-bit op.
>
Full access to constants.
>
>
That would be better, but is unlikely within the existing encoding
constraints.
>
But, say, if one burned one of the remaining unused "OP Rd, Rs, Imm12s"
encodings as an Imm17s, well then...
Dropping compressed instructions gives enough OpCode room to put the
entire My 66000 ISA in what remains.
Probably true.
Looks like Qualcomm thought similar:
Apparently they dropped the 'C' extension and added a bunch of stuff carried over from ARM and put it in the same space.
Apparently in their version of the RISC-V ISA, there are mostly 32 and 64 bit instructions (with 48 bit being skipped over), with a mandatory 32-bit alignment for the instruction stream.
Though, it also appears this decision has been... controversial...
Some stuff also implies there has been sort of a 3 way fight between Qualcomm, SiFive, and Google, over a lot of this. But, I don't know the specifics (beyond, well, SiFive wanting RV64GC to be the standard).
>
With the OpCode space already 98% filled there does not need to
be such a list.
>
>
One would still need it if multiple parties want to be able to define an
extension independently of each other and not step on the same
encodings.
>
And what kind of code compatibility would you have between different
designs...
If people can agree as to the encodings, then implementations are more free to pick which extensions they want or don't want.
If the encodings conflict with each other, no such free choice is possible.
Well, short of making the CPU modal.
Then one need not care as much about encoding issues, they can glue multiple ISA's together if they want.
Or, like, the newer RasPi Pico core:
Rather than having both Cortex-M33 and ThunderX3 RV32IMC cores, they could have in premise implemented a single core with both Thumb2 and RV32IMC decoders (say, with the CPU pattern matching the entry-point instruction at power-up; and requiring the entry point to contain a branch instruction in the respective ISA).
The closest we have on the latter point is the "Composable Extensions"
extension by Jan Gray, which seems to be mostly that part of the ISA's
encoding space can be banked out based on a CSR or similar.
>
>
Though, bigger immediate values and register-indexed loads do arguably
better belong in the base ISA encoding space.
>
Agreed, but there is so much more.
>
FCMP Rt,#14,R19 // 32-bit instruction
ENTER R16,R0,#400 // 32-bit instruction
..
>
>
These are likely a bit further down the priority list.
>
>
Prolog/Epilog happens once per function, and often may be skipped for
small leaf functions, so seems like a lower priority. More so, if one
lacks a good way to optimize it much beyond the sequence of load/store
ops which is would be replacing (and maybe not a way to do it much
faster than however can be moved in a single clock cycle with the
available register ports).
My 1-wide machines does ENTER and EXIT at 4 registers per cycle.
Try doing 4 LDs or 4 STs per cycle on a 1-wide machine.
It likely isn't going to happen because a 1-wide machine isn't going to have the needed register ports.
But, if one doesn't have the register ports, there is likely no viable way to move 4 registers/cycle to/from memory (and it wouldn't make sense for the register file to have a path to memory that is wider than what the pipeline has).
If one settles on 2-wide as the minimum, they can load/store 2 registers per clock-cycle.
I am leaving out the possibility of 2-way ports mostly because generally "inout" ports and tristate logic are seemingly only really allowed on external IO pins or as optimization hints to the logic synthesis.
As noted, I was mostly going with a Load/Store instruction that does a register pair.
Could maybe do a register triple, but, no... Also, it would likely add cost to widen the L1 D$ front-end memory interface to 192 bits (and not likely to be worth the wonk this would add).
>
>
>
At present, I am still on the fence about whether or not to support the
C extension in RISC-V mode in the BJX2 Core, mostly because the encoding
scheme just sucks bad enough that I don't really want to deal with it.
>
>
Realistically, can't likely expect anyone else to adopt BJX2 though.
>
Captain Obvious strikes again.
>
>
This is likely the fate of nearly every hobby class ISA.
>
Time to up your game to an industrial quality ISA.
Open question of what an "industrial quality" ISA has that BJX2 lacks...
Limiting the scope to things that RISC-V and ARM have.
Well, besides things like PCIe lanes...
And USB support that actually works.
But, a higher priority is trying to figure out the whole "RV64G isn't working with virtual memory" issue, where I have yet to find anything that causes a difference *other* than running with virtual memory enabled.
Basically, at this point, it has more turned into an ongoing slog of debugging, which is personally less interesting. And, seemingly, the more debugged it get, the more elusive the remaining bugs become (either by disappearing again as soon as one tries to start tracking them down, or otherwise refusing to give up the secret as to where exactly they are happening or what is their cause).
Well, OK, I did now find something at least:
Program does start running if I load the ELF image into physically-backed RAM rather than pagefile backed virtual memory (implying it may be related either to the pagefile, or running above the 4GB mark; but the PEL loader also loads programs above the 4GB mark without much issue; with the same sorts of flags used when allocating virtual memory).
Then again, it might also have to do with the "mprotect()" call (maybe something is going amiss with setting the memory to "RWX usermode"?...).
Either way, general behavior was seemingly that program was crashing nearly as soon as a branch was made into the RISC-V code, but only in this particular context.
...
As can be noted, in terms of ISA features, BJX2 is a superset of RV64G (and there are relatively few things that RISC-V can express that BJX2 can't).
Main differences then are:
Lack of software and toolchain support;
The core ISA listing is bigger than the RISC-V listing.
Granted, it currently being a 3-way hydra of Baseline, XG2, and RV64G, probably doesn't help matters.
Might make sense to choose "one of them" as the direction forward.
Or, say:
Baseline: Smaller binaries;
XG2: Faster;
RV64G: Binary compatibility with other stuff.
But, slightly worse code density and slower (*1).
*1: For a while, RV64 was slightly ahead in terms of code density, but it turned out a lot of this was due to BGBCC including runtime library functionality that wasn't being used (along with "runtime library build lag", where code that had been added to the runtime library was reflected in the BJX2 builds but not in the RV builds).
Though, between XG2 and RV64G, it is pretty close to break even for non-relocatable / bare-metal binaries.
For PIE binaries, both code density and performance take a fairly significant hit with RV64G (so, it is "the worse option" on this front).
And, to make RV more competitive, one needs 1-cycle ALU ops and 2-cycle memory loads, because RV also takes a (comparably larger) hit with 2-cycle ALU ops and 3-cycle memory loads.
The difference may well be bigger if BGBCC's code generation didn't "kinda suck"...
...