Newsportal USENET - Re: On my AMD FX-8370 I don't benefit from a compact code area.

albert@spenarnc.xs4all.nl writes:

I test lina64 on my AMD FX-8370 8 core 4 Ghz.
>
The genuine Byte benchmark sieve takes 1.5 ms on my unmodified lina.
That is a indirect threaded Forth with no optimisation and all the
machine code scattered throughout the dictionary.
>
I build a version where there is actually a code segment and all code is
collected there. There was no significant difference in speed.
>
All the code of the Forth fits comfortable in the L1 cache.
Is this to be expected?
An L1 cache hit is an L1 cache hit?

Not at all. Since the Pentium and the K5 (I think) there is an
instruction cache and a data cache (and then uop caches, which can be
seen as a kind of instruction cache). However, apart from the early
ones (Pentium, K6, and probably K5), the same grains (with typically
64-byte granularity these days) can reside in both the I-cache and the
D-cache, as long as that grain is not written to.

So if your complete Forth system including the primitives and the
sieve program fits into the D-cache and fits into the I-cache, and you
have no writes close to code, you will indeed only see compulsory
misses.

I have posted here about the performance pitfalls of keeping code
close to data since 1995, and Forth system implementors typically have
taken measures only when I presented benchmark results where there
system looks bad. But they usually only did the minimum necessary for
that particular benchmark, so over the years the issue has come up
again and again.

One interesting aspect is that small benchmarks like the sieve are
often not affected, but larger application benchmarks are. E.g., in
my recent work [ertl24] all the small benchmarks are unaffected by the
problem, whereas several of the larger benchmarks were affected in
SwiftForth-4.0.0-RC87 and saw significant speedups from a fix in RC89.

So I applaud that you have done the right thing and completely
separated code from data. You may not see a benefit on Sieve, but
there may be a difference in a different program (and you may not even
notice until you measure both variants).

@InProceedings{ertl24,
author =    {M. Anton Ertl},
title = {How to Implement Words (Efficiently)},
crossref =    {euroforth24},
pages = {43--52},
url = {http://www.euroforth.org/ef24/papers/ertl.pdf},
url-slides =   {http://www.euroforth.org/ef24/papers/ertl-slides.pdf},
video = {https://www.youtube.com/watch?v=bAq4760h5ZQ},
OPTnote = {not refereed},
abstract =    {The implementation of Forth words has to satisfy the
following requirements: 1) A word must be
represented by a single cell (for
\code{execute}). 2) A word may represent a
combination of code and data (for, e.g.,
\code{does>}). In addition, on some hardware,
keeping executed native code and (written) data
close together results in slowness and therefore
should be avoided; moreover, failing to pair up
calls with returns results in (slow) branch
mispredictions. The present work describes how
various Forth systems over the decades have
satisfied the requirements, and how many systems run
into performance pitfalls in various situations.
This paper also discusses how to avoid this
slowness, including in native-code systems.}
}
@Proceedings{euroforth24,
title = {40th EuroForth Conference},
booktitle = {40th EuroForth Conference},
year = {2024},
key = {EuroForth'24},
url = {http://www.euroforth.org/ef24/papers/proceedings.pdf}
}

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
   New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

Date	Sujet	#	Auteur
27 Feb 25	On my AMD FX-8370 I don't benefit from a compact code area.	3	albert
27 Feb 25	Re: On my AMD FX-8370 I don't benefit from a compact code area.	2	Anton Ertl
28 Feb 25	Re: On my AMD FX-8370 I don't benefit from a compact code area.	1	albert