On 4/18/2024 7:38 AM, pozz wrote:
The request is very common: when some interested events occur in the system, they, with the related timestamp, must be saved in a log. The log must be saved in an external SPI Flash connected to a MCU. The log has a maximum number of events. After the log is completely filled, the new event overwrite the oldest event.
A circular buffer.
I tried to implement such library, but I eventually found it's not a simple task, mostly if you want a reliable library that works even when errors occur (for example, when one writing fails).
That depends on the sorts of failures that can occur and that are likely
to occur. If, for example, power fails during a write... (do you know
how your hardware will handle such an event? even if you THINK you
have some guarantee that it can't happen -- because you have an early
warning indication from the power supply?)
Can a write APPEAR successful and yet the data degrade later (so your
notion of where the next message should start changes based on whether
or not you currently KNOW where it should start vs. having to DISCOVER
where it should start?
I started by assigning 5 sectors of SPI Flash to the log. There are 256 events in a sector (the event is a fixed struct).
In this scenario, the maximum log size is 1024 events, because the 5th sector can be erased when the pointer reaches the end of a sector.
You're being overly conservative; expecting new events to occur WHILE you
are erasing the "extra" sector. If you can cache the errors elsewhere
while the erase is in progress, then you can use the extra sector, too
(and defer erasing until ALL sectors are full AND another event occurs)
The first challenge is how the system can understand what is the first (newest) event in the log at startup. I solved saving a 16-bits counter ID to each event. The initialization routine starts reading all the IDs and taks the greatest as the last event.
However initially the log is empty, so all the IDs are 0xFFFF, the maximum. One solution is to stop reading events when 0xFFFF is read and wrap-around ID at 0xFFFE and not 0xFFFF.
In your scheme, one sector will be erased or in the process of being erased.
So, you can (almost) rely on its contents (unless the erase can be interrupted
by a power event)
However there's another problem. What happens after writing 65535 events in the log? The ID restarts from 0, so the latest event hasn't the greatest ID anymore.
This is the same as handling a circular buffer; how address N<M can actually
represent something that occurs AFTER the data at M. Why do you see it as
anything different?
These are the saved IDs after 65536 events:
1^ SECT 2^ SECT 3^ SECT 4^ SECT 5^SECT---------->
0xFB00 ... 0xFC00 ... 0xFD00 ... 0xFE00 ... 0xFF00 ... 0xFFFF
The rule "newest event has greatest ID" is correct yet. Now a new event is written:
1^ SECT-------> 2^ SECT 3^ SECT 4^ SECT 5^SECT--------->
0x0000 0xFB01.. 0xFC00 .. 0xFD00 .. 0xFE00 .. 0xFF00 .. 0xFFFF
Now the rule doesn't work anymore. The solution I found is to detect discontinuity. The IDs are consecutive, so the initialization routine continues reading while the ID(n+1)=ID(n)+1. When there's a gap, the init function stops and found the ID and position of the newest event.
Unless, of course, the entry at n or n+1 are corrupt...
But he problems don't stop here. What happens if an event write failed? Should I verify the real written values, by reading back them and comparing with original data? In a reliable system, yes, I should.
So, you've answered your own question. If you want the log to have
value, then you have to put effort into ensure its integrity.
Note my earlier comment "Can a write APPEAR successful and yet..."
I was thinking to protect an event with a CRC, so to understand at any time if an event is corrupted (bad written). However the possibility to have isolated corrupted events increase the complexity of the task a lot.
You can't assume ANYTHING about a corrupted event.
Suppose to write the 4th event with ID=3 at position 4 in the sector. Now suppose the write failed. I can try to re-write the same event at poisition 5. Should I use ID=4 or ID=5? At first, I think it's better to use ID=5. The CRC should detect that at position 4 is stored a corrupted event.
After that two events are written as well, ID=6 and 7.
Now the higher application wants to read the 4th event of the log (starting from the newest). We know that the newest is ID=7, so the 4th event is ID=7-4+1=4. However the event with ID=4 is corrupted, so we should check the previous ID=3... and so on.
Now we can't have a mathematical function that returns the position of an event starting from it's ordinal number from the newest event.
Eventually I think it's better to use a public domain library, if it exists.
You want an existing library to be able to address arbitrary log entry formats
on arbitrary media with arbitrary failure modes (hardware and software)?
The only time you really care about recovering data from the log is when you
actually *want* (need) to recover data. (??) Can the log automatically
restart with each device reset? (and, if you want to preserve the log,
have a mechanism that inhibits logging after a "diagnostic reset")
Assuming there is no external control data for the log (i.e., that you have
to infer the structure of the log from the log's contents), a generic algorithm
is to start at the first memory location that *can* hold the start of a
message. Then, look at that piece of memory (taking into account any "wrap"
caused by the circular nature of the buffer) and determine if it represents
a valid log message (e.g., if you store a hash of the message in the message,
then the hash must be correct in addition to the format/content of the
message).
If a valid message is encountered, return SUCCESS along with the length of
the message (so you will know where to start looking for the NEXT message).
[You already know where THIS message starts]
If an invalid message is encountered, return FAIL along with a number of
bytes that can safely be skipped over to begin searching for the next
message. In the degenerate case, this is '1'.
Based on YOUR expected failure modes (and your write strategy), you can then
walk through the log in search of "good" messages and extract their ID's
(sequence numbers).
[Note that this scheme allows messages to be of variable length]
If you rely on sequence numbers or "short" timestamps that can wrap in the
context of the log buffer's length, then you need to use a numbering scheme
that is relatively prime wrt the buffer length otherwise you could encounter
sets of messages that can appear to start "anywhere".
Note that your IDs are just abbreviated timestamps; with arbitrary time
intervals between them. Using a "system time" that is always going to
increase monotonically often offers more USEFUL information as it lets
you decide how events are temporarily related (did event #5 happen
immediately after event #4? Or, two weeks later??)
Having a system time that you always reinitialize with the last noted
value (assuming you can't track time when powered off) means that
chronology/sequence is always evident. (you can add events like "User
set Wall Clock Time to XX:YY:ZZ" so you can map the logged events to
"real" time for convenience -- is something happening at some particular
time of day that may be pertinent??)