Newsportal USENET - Re: C23 thoughts and opinions

On 14/06/2024 23:30, Keith Thompson wrote:

David Brown <david.brown@hesbynett.no> writes:
On 28/05/2024 22:21, Keith Thompson wrote:
David Brown <david.brown@hesbynett.no> writes:
On 28/05/2024 02:33, Keith Thompson wrote:
[...]
Without some kind of programmer control, I'm concerned that the rules
for defining an array so #embed will be correctly optimized will be
spread as lore rather than being specified anywhere.
>
They might, but I really do not think that is so important, since they
will not affect the generated results.
Right, it won't affect the generated results (assuming I use it
correctly). Unless I use `#embed optimize(true)` to initialize
a struct with varying member sizes, but that's my fault because I
asked for it.
>
I am still not understanding your point. (I am confident that you
have a point, even if I don't get it.)
>
I cannot see why there would be any need or use of manually adding
optimisation hints or controls in the source code. I cannot see why
the there is any possibility of getting incorrect results in any way.
>
The point is compile-timer performance, and perhaps even the ability
to compile at all.
I'm thinking about hypothetical cases where I want to embed a
*very* large file and parsing the comma-delimited sequence could
have unacceptable compile-time performance, perhaps even causing
a compile-time stack overflow depending on how the parser works.
Every time the compiler sees #embed, it has to decide whether to
optimize it or not, and the decision criteria are not specified
anywhere (not at all in the standard, perhaps not clearly in the
compiler's documentation).
>
>
Yes, I agree with that. And this is how it should be - this is not
something that should be specified. The C standards give minimum
requirements for things like the number of identifiers or the length
of lines. But pretty much all compilers, for most of the "translation
limits", say they are "limited by the memory of the host computer".
The same will apply to #embed. And some compilers will cope better
than others with huge #embed's, some will be faster, some more memory
efficient. Some will change from version to version. This is not
something that can sensibly be specified or formalized - like pretty
much everything in regard to compilation time, each compiler does the
best it can without any specifications. I'd expect compiler reference
manuals might have hints, such as saying #embed is fastest with
unsigned char arrays (or whatever), but no more than that.
>
But again - I see no reason for manual optimisation hints, and no
reason for any possible errors.
>
Let me outline a possible strategy for a compiler like gcc. (I have
not looked at the prototype implementations from thephd, nor any gcc
developer discussions.)
>
gcc splits the C pre-processor and the compiler itself, and
(currently) communicates dataflow in only one direction, via a
temporary file or a pipe. But the "gcc" (or "g++", according to
preference) driver program calls and coordinates the two programs.
>
If the pre-processor is called stand-alone, then it will generate a
comma-separated list of integers, helpfully split over multiple lines
of reasonable size. This will clearly always be correct, and always
work, within limits of a compiler's translation limits.
>
But when the gcc driver calls it, it will have a flag indicating that
the target compiler is gcc and supports an extended pre-processed
syntax (and also that the source is C23 - after all, the C
pre-processor can be used as a macro processor for other files with no
relation to C). Now the pre-processor has a lot more freedom.
Whenever it meets an #embed directive, it can generate a line :
>
#embed_data 123456
>
followed in the file by 123456 (or whatever) bytes of binary data.
The C compiler, when parsing this file, will pull that in as a single
blob. Then it is up to the C compiler - which knows how the #embed
data will be used - to tell if the these bytes should be used as
parameters to a macro, initialisation for a char array, or whatever.
And it can use them as efficiently as practically possible. (It is
probably only worth using this for #embed data over a certain size -
smaller #embed's could just generate the integer sequences.)
>
Nowhere in this is there any call of manual optimisation hints, nor
any risk of incorrect results.
I've kept this on the back burner for a couple of weeks. I'm finally
getting around to posting a followup.
I'm not particular concerned about compilers processing #embed
incorrectly. It's conceivable that a compiler could incorrectly decide
that it can optimize a particular #embed directive, but I expect
compilers to be conservative, falling back to the specified behavior if
they can't *prove* that an optimization is safe.

I'd expect that too. (Of course there's always the risk of bugs with weird use-case)

I see two conceptual problems with #embed as it's currently defined in
N3220.
First, there's a possible compile-time performance issue for very large
embedded files. The (draft) standard calls for #embed to expand to a
comma-separated list of integer constant expressions. (I'm not sure why
it didn't specify integer constants.)
My objection is based on the possibility that #embed for a *very* large
file might result in unacceptable time and memory usage during compile
time. I haven't looked into how existing compilers handle large
initializers, but I can imagine that parsing such a list might consume
more than O(N) time and/or memory, or at least O(N) with a large
constant. (If parsing long lists of integer constants is expensive for
some compiler, this could be a motivation to optimize that particular
case.)

The point of #embed is to get O(N) scaling - or at least, much closer to that than compilers do today with an #include of a list of numbers (or even a string literal). There is little doubt that a big enough #embed file will consume time and memory that is unacceptable, at least for some people - all you need is to pick a file bigger than your computer's memory, and you can be reasonably confident that it will be problematic. But it also seems reasonable to expect that if a file is big enough to cause trouble for #embed, then any other method of including it in a C file will be at least as bad and probably /much/ worse.
At worst, #embed is going to be no less efficient than today's solution, and at best it will be significantly more efficient. I don't think it is fair to object to it because a given implementation might not reach theoretical optimum efficiencies.

The intent of #embed is to copy the contents of a file at compile time
into an array of unsigned char -- but it's specified in a roundabout way
that requires bizarre usages to work "correctly".

That is one expected use, and will probably be the biggest use by a fair way, but it is not the only possible use. The specification lets you have more flexibility. For example, I have a project where I include a number of files in a structure with a number of unsigned char arrays, amongst other data - a simpler #embed solution that forced you to have an unsigned char array might not work with that. (The project predates #embed and uses a Python script to generate the data.)

I expect at least
some compilers to optimize #embed for better compile-time performance,
but that requires them to determine when optimization is permitted with
no advice from the standard about how to do that. That's going to be
moderately difficult for compiler implementers; I'm not too concerned
about that. But it also imposes a burden on programmers, who will have
to use trial and error to determine how to ensure a #embed is optimized.

I am entirely confident that major compiler vendors will optimise the case of initialising char arrays. For anything else, who cares? It is unlikely that you'd use #embed for other purposes with files that are big enough for unoptimised implementations to be unreasonably slow. And if that does turn out to be a problem in practice, then you /know/ you have huge files and are doing something weird, and you can use something other than #embed for the purpose in the same way you do today.
Of prime importance is /correctness/ - #embed should give the results you expect, and I can't see that being a problem. Outside that, #embed is always going to be at least as efficient as existing solutions, and usually much faster for cases that matter.

This all assumes that a naive #embed implementation is going to
cause real problems for very large embedded files (compile-time
stack overflows, unreasonably long compile times, or just using so
much memory that system performance is affected). If it turns out
that this isn't the case, then that objection is mostly addressed.

I don't believe "very large" embedded files are of any real-world use in the first place.
And I don't believe there will be any naïve implementations of any significance. gcc and clang are the only two C compilers with a realistic future for serious C work with newer standards. Even MS expect people to use clang for C, as far as I understand it. A number of other toolchains in the embedded world have switched over, or plan to do so - it is simply not worth the development effort. Niche C compilers will continue to exist, but it's unlikely they will bother with C23.

My other objection is that it's conceptually messy. The expected use
case is in an initializer for an array of unsigned char, but there are
no restrictions on where it can be used.

That is the point.

As a programmer, I want to
copy a file verbatim into an unsigned char array, but at least
conceptually #embed translates the file contents into a long sequence of
expressions which are then processed as C code to recreate the raw data.
There are bizarre cases (like my previous example initializing a struct
with members of various types) that are required to work. #embed is a
preprocessor directive, but determining whether it can be optimized
requires feedback from later compiler phases. It's doable, but it's
*ugly*.

I have discussed in previous posts why I don't think there is an issue there.
And I think alternative ways to achieve the effect would have their own problems and complications. (I believe there is a proposal for C++ that includes a std::embed() function that can use a constexpr string.)

Now that it's too late to change the definition, I've thought of
something that I think would have been a better way to specify #embed.
Define a new kind of string literal, with a "uc" prefix. `uc"foo"` is
of type `unsigned char[3]`. (Or `const unsigned char[3]`, if that's not
too radical.) Unlike other string literals, there is no implicit
terminating '\0'. Arbitrary byte values can of course be specified in
hexadecimal: uc"\x01\x02\x03\x04". Since there's no terminating null
character and C doesn't support zero-sized objects, uc"" is a syntax
error.

If you are worried about ugly, few things are uglier than a C string literal with escaped hex characters. Well, escaped octal characters are worse.

uc"..." string literals might be made even simpler, for example allowing
only hex digits and not requiring \x (uc"01020304" rather than
uc"\x01\x02\x03\x04"). That's probably overkill. uc"..." literals
could be useful in other contexts, and programmers will want
flexibility. Maybe something like hex"01020304" (embedded spaces could
be ignored) could be defined in addition to uc"\x01\x02\x03\x04".
Specify that #embed expands to a sequence of one or more uc string
literals (or hex string literals if that's added), separated by
whitespace. If the embedded file might be empty, use the existing
is_empty() embed parameter. Without is_empty, #embed of an empty file
will expand to uc"", a syntax error.
Since a string literal is a single token, parsing it is likely to be
more efficient than parsing a sequence of integer constant expressions,
even with concatenation of multiple literals. Since a uc"..." string
literal is specifically of type unsigned char[], it can *only* be used
to initialize an unsigned char[] or unsigned char* object, addressing
the conceptual mess. If you want to use #embed to initialize an
array of some other type, you can use a union or some other form of
type-punning.
A conforming C23 implementation could even implement this by providing
uc"..." (and perhaps hex"...") literals as an extension and adding an
implementation-defined embed parameter that generates them.

I am at a loss to see how this would be any improvement.
The efficiency gains of #embed are not because a list of integers is inherently less efficient than a string literal of some kind. It is because existing compilers store more information about each element, and do more checking on each of them (such as for range). With #embed-generated integer lists the compiler would not need to store this extra information or do the extra checks. Even for "non-optimised" #embed, I cannot see it being beaten by any kind of string literal solution by any non-negligible degree.

Date	Sujet	#	Auteur
14 Jun 24	Re: C23 thoughts and opinions	56	Keith Thompson
14 Jun 24	Re: C23 thoughts and opinions	12	bart
15 Jun 24	Re: C23 thoughts and opinions	11	David Brown
15 Jun 24	Re: C23 thoughts and opinions	10	bart
15 Jun 24	Re: C23 thoughts and opinions	5	Lawrence D'Oliveiro
16 Jun 24	Re: C23 thoughts and opinions	4	bart
16 Jun 24	Re: C23 thoughts and opinions	1	Lawrence D'Oliveiro
16 Jun 24	Re: C23 thoughts and opinions	2	Chris M. Thomasson
17 Jun 24	Re: C23 thoughts and opinions	1	Lawrence D'Oliveiro
16 Jun 24	Re: C23 thoughts and opinions	4	David Brown
16 Jun 24	Re: C23 thoughts and opinions	3	bart
17 Jun 24	Re: C23 thoughts and opinions	1	David Brown
17 Jun 24	Re: C23 thoughts and opinions	1	Michael S
15 Jun 24	Re: C23 thoughts and opinions	3	David Brown
15 Jun 24	Re: C23 thoughts and opinions	2	Lawrence D'Oliveiro
16 Jun 24	Re: C23 thoughts and opinions	1	David Brown
17 Jun 24	Hex string literals (was Re: C23 thoughts and opinions)	40	Keith Thompson
17 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	20	David Brown
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	18	Keith Thompson
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	2	Lawrence D'Oliveiro
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	Keith Thompson
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	15	David Brown
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	6	Keith Thompson
19 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	5	David Brown
19 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	4	Kaz Kylheku
19 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	3	Michael S
19 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	bart
19 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	Michael S
19 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	8	Lawrence D'Oliveiro
19 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	6	David Brown
21 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	5	Lawrence D'Oliveiro
21 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	3	David Brown
21 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	2	Lawrence D'Oliveiro
22 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	David Brown
21 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	James Kuyper
19 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	Keith Thompson
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	Lawrence D'Oliveiro
17 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	5	Richard Kettlewell
17 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	Richard Kettlewell
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	3	Keith Thompson
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	Lawrence D'Oliveiro
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	Richard Kettlewell
17 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	14	bart
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	Keith Thompson
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	7	Tim Rentsch
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	6	Michael S
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	2	bart
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	Tim Rentsch
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	David Brown
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	Tim Rentsch
20 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	Lawrence D'Oliveiro
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	5	Kaz Kylheku
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	4	David Brown
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	3	Richard Harnden
18 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	Richard Harnden
21 Jun 24	Re: Hex string literals (was Re: C23 thoughts and opinions)	1	Lawrence D'Oliveiro