Sujet : Re: Efficiency of in-order vs. OoO
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.archDate : 25. Mar 2024, 20:33:44
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <utsjj0$19bs4$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13
User-Agent : Mozilla Thunderbird
On 3/25/2024 1:35 PM, Anton Ertl wrote:
scott@slp53.sl.home (Scott Lurndal) writes:
There is a significant demand for performance monitoring. Note
that in addition to to standard performance monitoring registers,
AArch64 also (optionally) supports statistical profiling and
out-of-band instruction tracing (ETF). The demand from users
is such that all those features are present in most designs.
Interesting. I would have expected that the likes of me are few and
far between, and easy to ignore for a big company like ARM, Intel or AMD.
My theory was that the CPU manufacturers put performance monitoring
counters in CPUs in order to understand the performance of real-world
programs themselves, and how they should tweak the successor core to
relieve it of bottlenecks.
Odd...
I had mostly skipped any performance counters in hardware, and was instead using an emulator to model performance (among other things), but for performance tuning this only works so much as the emulator remains accurate in terms of cycle costs (I make an effort, but seems to vary).
One annoyance is that trying to model some newer or more advanced features may bog down the emulator enough that it can't maintain real-time performance.
Though, I guess it is likely that for a "not painfully slow" processor (like an A55 or similar) cycle-accurate emulation in real-time at the narive clock speed may not be viable (one would burn way to many cycles trying to model things like the cache hierarchy and branch predictor, ...).
Some amount of debugging and performance measurements are possible via "LED" outputs, which show the status of pipeline stalls and the signals that feed into these stalls (and, in directly, the percentage of time spent running instructions via the absence of stalls, ...), ...
Had generated a cycle-use ranking for the full simulation by having the testbench code run checks mostly on these LED outputs (vs looking at them directly).
Runs on an actual FPGA are admittedly comparably infrequent.
Though, ironically, have noted that things like shell commands, etc, can still be fairly responsive even for Verilog simulations effectively running in kHz territory (where, good responsiveness is sometimes a struggle even for modern PCs running Windows).
Or, having recently been working on a tool, and due to some combination of factors, at one point in the testing kept taking around 20 seconds each time for process creation, which was rather annoying (because seemingly Windows would just upload the whole binary to the internet, then wait for a response, before letting it run).
Seemingly, something about the tool was apparently triggering "Windows Defender SmartShield" or similar; it never gave any warnings/messages about it, merely caused a fairly long/annoying delay whenever relaunching the tool. Then just magically went away (after one of my secondary UPS's had "let the smoke out" and also the ISP had partly went down for a while; could see ISP local IPs but access to the wider internet was seemingly disrupted, ... Like, seemingly, a "the ghosts in the machine are not happy right now" type event).
The tool itself was mostly writing something sorta like SFTP but instead for working with disk images. Starting to want to revisit the filesystem question, but looking back at NTFS, still don't really want to try to implement an NTFS driver.
Possibly EXT2/3/4 would be an option, apart from the annoyance that Windows can't access it, so I would still be little better off than just rolling my own (and trying to have the core design be hopefully not needlessly complicated).
...
- anton