Liste des Groupes | Revenir à cl c |
On 10/09/2024 01:58, Waldek Hebisch wrote:David Brown <david.brown@hesbynett.no> wrote:On 09/09/2024 16:36, Waldek Hebisch wrote:David Brown <david.brown@hesbynett.no> wrote:>On 08/09/2024 23:34, Waldek Hebisch wrote:>David Brown <david.brown@hesbynett.no> wrote:>>>
And while microcontrollers sometimes have a limited form of branch
prediction (such as prefetching the target from cache), the more
numerous and smaller devices don't even have instruction caches.
Certainly none of them have register renaming or speculative execution.
IIUC STM4 series has cache, and some of them are not so big. There
are now several chinese variants of STM32F103 and some of them have
caches (some very small like 32 words, IIRC one has 8 words and it
is hard to decide if this very small cache or big prefetch buffer).
There are different kinds of cache here. Some of the Cortex-M cores
have optional caches (i.e., the microcontroller manufacturer can choose
to have them or not).
>
<https://en.wikipedia.org/wiki/ARM_Cortex-M#Silicon_customization>
I do not see relevent information at that link.
There is a table of the Cortex-M cores, with the sizes of the optional
caches.
>>Flash memory, flash controller peripherals, external memory interfaces>
(including things like QSPI) are all specific to the manufacturer,
rather than part of the Cortex M cores from ARM. Manufacturers can do
whatever they want there.
AFAIK typical Cortex-M design has core connected to "bus matrix".
It is up to chip vendor to decide what else is connected to bus matrix.
Yes.
>
However, there are other things connected before these crossbar
switches, such as tightly-coupled memory (if any).
TCM is _not_ a cache.
Correct. (I did not suggest or imply that it was.)
And the cpu caches
(if any) are on the cpu side of the switches.
Caches are attached were system designer thinks they are useful
(and possible). Word "cache" has well-estabished meaning and
ARM (or you) has no right to redefine it.
I am using it in the manner ARM uses it when talking about ARM
processors and microcontroller cores. I think that is the most relevant
way to use the term here. The term "cache" has many meanings in many
contexts - there is no single precise "well-established" or "official"
meaning.
Context is everything. That is why I have been using the term
"cpu cache" for the cache tied tightly to the cpu itself, which comes as
part of the core that ARM designs and delivers, along with parts such as
the NVIC.
And I have tried to use terms such as "buffer" or "flash
controller cache" for the memory buffers often provided as part of flash
controllers and memory interfaces on microcontrollers, because those are
terms used by the microcontroller manufacturers.
Manufacturers also have a
certain amount of freedom of the TCMs and caches, depending on which
core they are using and which licenses they have.
>
There is a convenient diagram here:
>
<https://www.electronicdesign.com/technologies/embedded/digital-ics/processors/microcontrollers/article/21800516/cortex-m7-contains-configurable-tightly-coupled-memory>
>For me it does not matter if it is ARM design or vendor specific.>
Normal internal RAM is accessed via bus matrix, and in MCU-s that
I know about is fast enough so that cache is not needed. So caches
come into play only for flash (and possibly external memory, but
design with external memory probably will be rather large).
>
Typically you see data caches on faster Cortex-M4 microcontrollers with
external DRAM, and it is also standard on Cortex-M7 devices. For the
faster chips, internal SRAM on the AXI bus is not fast enough. For
example, the NXP i.mx RT106x family typically run at 528 MHz core clock,
but the AXI bus and cross-switch are at 133 MHz (a quarter of the
speed). The tightly-coupled memories and the caches run at full core speed.
OK, if you run core at faster clock than the bus matrix, then cache
attached on core side make a lot of sense. And since cache has to
compensate for lower bus speed it must be resonably large.
Yes.
But
if you look at devices where bus matrix runs at the same clock
as the core, then it makes sense to put cache on the other side.
No.
You put caches as close as possible to the prime user of the cache. If
the prime user is the cpu and you want to cache data from flash,
external memory, and other sources, you put the cache tight up against
the cpu - then you can have dedicated, wide, fast buses to the cpu.
But it can also make sense to put small buffers as part of memory
interface controllers. These are not organized like data or instruction
caches, but are specific for the type of memory and the characteristics
of it.
How this is done depends on details of the interface, details of
the internal buses, and how the manufacturer wants to implement it. For
example, on one microcontroller I am using there are queues to let it
accept multiple flash read/write commands from the AHB bus and the IPS
bus, but read-ahead is controlled by the burst length of read requests
from the cross-switch (which in turn will come from cache line fill
requests from the cpu caches). On a different microcontroller, the
read-ahead logic is in the flash controller itself as that chip has a
simpler internal bus where all read requests will be for 32 bits (it has
no cpu caches). An external DRAM controller, on the other hand, will
have queues and buffers optimised for multiple smaller transactions and
be able to hold writes in queues that get lower priority than read requests.
These sorts of queues and buffers are not generally referred to as
"caches", because they are specialised queues and buffers. Sometimes
you might have something that is in effect perhaps a two-way
single-entry 16 byte wide read-only cache, but using the term "cache"
here is often confusing. At best it is a "flash controller cache", and
very distinct from a "cpu cache".
It seems that vendor do not like to say that they use cache, instead>
that use misleading terms like "flash accelerator".
That all depends on the vendor, and on how the flash interface
controller. Vendors do like to use terms that sound good, of course!
>>>So a "cache" of 32 words is going to be part of the flash interface, not>
a cpu cache
Well, caches never were part of CPU proper, they were part of
memory interface. They could act for whole memory or only for part
that need it (like flash). So I do not understand what "not a cpu
cache" is supposed to mean. More relevant is if such thing act
as a cache, 32 word things almost surely will act as a cache,
8 word thing may be a simple FIFO buffer (or may act smarter
showing behaviour typical of caches).
>
Look at the diagram in the link I gave above, as an example. CPU caches
are part of the block provided by ARM and are tightly connected to the
processor. Control of the caches (such as for enabling them) is done by
hardware registers provided by ARM, alongside the NVIC interrupt
controller, SysTick, MPU, and other units (depending on the exact
Cortex-M model).
>
This is completely different from the small buffers that are often
included in flash controllers or external memory interfaces as
read-ahead buffers or write queues (for RAM), which are as external the
processor core as SPI, UART, PWM, ADC, and other common blocks provided
by the microcontroller manufacturer.
The disscussion started about possible interaction of caches
and virtual function dispatch.
OK - I admit to having lost track of the earlier discussion, so that is
helpful.
This interaction does not depend
on you calling it cache. It depends on cache hits/misses,
their cost and possible eviction. And actually small caches
can give "interesting" behaviour: with small code footprint there
may be 100% hit ratio, but one extra memory reference may lead
to significant misses. And even small caches behave differently
then simple buffers.
I agree that behaviour can vary significantly.
When you have a "flash controller cache" - or read-ahead buffers - you
typically have something like a 60-80% hit ratio for sequential code and
nearly 100% for very short loops (like you'd have for a memcpy() loop).
You have close to 0% hit ratio for branches or calls, regardless of
whether they are virtual or not (with virtual function dispatch
generally having one extra indirection at 0% hit rate). This is the
kind of "cache" you often see in microcontrollers with internal flash
and clock speeds of up to perhaps 150 Mz, where the flash might be at a
quarter of the main cpu clock.
>(which are typically 16KB - 64KB,>
I wonder where you found this figure. Such size is typical for
systems bigger than MCU-s. It could be useful for MCU-s with
flash a on separate die, but with flash on the same die as CPU
much smaller cache is adequate.
Look at the Wikipedia link I gave. Those are common sizes for the
Cortex-M7 (which is pretty high-end), and for the newer generation of
Cortex-M35 and Cortex-M5x parts. I have on my desk an RTO1062 with a
600 MHz Cortex-M7, 1 MB internal SRAM, 32 KB I and D caches, and
external QSPI flash.
OK, as I wrote it makes sense for them. But for smaller machines
much smaller caches may be adequate.
As I have said, they are not really caches in the same sense as you have
for a cpu cache.
But certainly a "flash controller cache" or read-ahead
buffer (especially if there are two of them) can make a big difference
to the throughput of a microcontroller, and equally certainly a cpu
cache would be an unreasonable cost in die area, power, and licensing
fees for most microcontrollers. Thus these small buffers - or very
small, very specialised caches in the flash controller - are a good idea.
>>and only found on bigger>
microcontrollers with speeds of perhaps 120 MHz or above). And yes, it
is often fair to call these flash caches "prefetch buffers" or
read-ahead buffers.
Typical code has enough branches that simple read-ahead beyond 8
words is unlikely to give good results. OTOH delivering things
that were accessed in the past and still present in the cache
gives good results even with very small caches.
There are no processors with caches smaller than perhaps 4 KB - it is
simply not worth it.
Historicaly there were processors with small caches. 256B in
Motorla chips and I think smaller too. It depends on the whole
design.
For a general cpu data cache on a modern cpu, the cache control logic is
probably going to require the same die area as a few KB of cache
storage, as a minimum - so it makes no sense to have such small cpu
caches. The logic for instruction caches is simpler. In days gone by,
balances were different and smaller caches could be useful. The 68020
had a 256 byte instruction cache, and the 68030 and 68040 added a 256
byte data cache. Both were single way.
Currently for "big" processors really small caches seem
to make no sense. Microconrollers have their own constaints.
Manufacturer may decide that cache giving 10% average improvement
is not worth uncertainilty of execution time. Or may decide that
small cache is the cheapest way to get better benchmark figures.
You are correct that microcontrollers have different constraints, and
that jitter and variation of timing is far more of a cost in
microcontrollers than it is on "big" processors, where throughput is
key. The other factor here is latency. On earlier designs such as the
aforementioned M68k family, you could often add a fair bit of logic
without requiring extra clock cycles. Thus the cache was "free". That
is different now, even on microcontrollers. Adding a cpu cache on even
the slowest of modern microcontrollers will mean at least a clock cycle
extra on cache misses compared to no cache - for medium devices (say,
120 MHz Cortex-M4) it would mean 2 or 3 extra cycles. So unless you are
getting a significant hit ratio, it is not worth it.
Putting read-ahead buffers and a "micro-cache", if that term suits you,
at the flash controller and other memory interfaces is, however, free in
terms of clock cycles and latency - these parts run at a higher clock
rate than the flash itself.
Read-ahead buffers on flash accesses are helpful,
however, because most code is sequential most of the time. It is common
for such buffers to be two-way, and to have between 16 and 64 bytes per
way.
If you read carefully description of STM "flash accelerator" it is
clear that this is classic cache, with line size matched to flash,
something like 2-set associativity, conflicts and eviction.
Historically there were variations, some caches only cache targets
of jumps and use prefetch buffer for linear code. Such caches
can be effective at very small size.
I don't know the STM "flash accelerator" specifically - there are many
ARM microcontrollers and I have not used them all. But while it is true
that some of these are organised in a similar way to extremely small and
restricted caches, I think using the word "cache" alone here is
misleading. That's why I have tried to distinguish and qualify the term.
And in the context of virtual function dispatch, a two-way single line
micro-cache is pretty much guaranteed to have a cache miss when doing
such indirect calls as you need the current code, the virtual method
table, and the virtual method itself to be in cache simultaneously to
avoid a miss. But these flash accelerators still make a big difference
to the speed of code in general.
Les messages affichés proviennent d'usenet.