Newsportal USENET - Re: Tcl9: source files are interpreted as utf-8 by default

In article <20250108172312.253b829c@lud1.home>, Luc <luc@sep.invalid> wrote:

On Wed, 8 Jan 2025 19:32:24 -0000 (UTC), Rich wrote:
>
Instead of main.tcl sourcing set_encoding.tcl, starter.tcl runs some
'encoding' command then sources main.tcl. Basically, a wrapper.
>
Yes, that works. But then Uwe has to go and "wrapperize" all the
various scripts, on all the various client systems. So he's back in
the same boat of "major modifications need be made now" as changing all
the launching instances to launch with "-encoding iso-8859".
>
True, but he has considered that kind of effort. His words:
>
>
"That means we have to add "-encoding iso8859-1"
to ALL source and ALL tclsh calls in ALL scripts.
So far, so good(or bad?)."
>
"What initially seems quite doable, looks more and more scary
to me. First, if we ever may switch encoding to utf-8 we
have to alter all those lines again."
>
>
So in my mind, the "customer" accepts (though grudgingly) making
large scale changes, but is concerned with possible new changes
in the future. A wrapper can handle the future quite gracefully.
>
>
I've resisted pointing this one out, but long term, yes, updating all
the scripts to be utf-8 encoded is the right, long term, answer. But
that belies all the current, short term effort, involved in doing so.
>
Actually, when I mentioned my migration case, I was also thinking that
I could afford to do it because I was migrating to Linux and utf-8 was
not even the future anymore, it was pretty much the present. But maybe
running iconv wouldn't be acceptable because Uwe is (I assume) on
Windows. Does a Windows user want to convert his files to utf-8?
Won't that cause problems if the system is iso-8859-1? Windows still
uses iso-8859-1, right?
>
So yes, I guess Tcl9 causes trouble to 8859-1 users. Yes, sounds like
it needs some fixing.
>
More suggestions: how about not using Tcl9 just yet? I'm stil on 8.6
and the water is fine. Early adopters tend to pay a price. In my case,
absent packages.
>
I have my own special case, I use Debian 9 which only ships 8.6.6 so
I had to build 8.6.15 from source because I really need Unicode.
But for some time I used Freewrap as a single-file batteries included
Tcl/Tk interpreter. So maybe Uwe should just use a different interpreter,
likely just a slightly older version of Tcl/Tk and embrace Tcl9 later.
>
I wonder if one can hack the encoding issue on the Tcl9 source and
rebuild it.
>
>
--
Luc
>
>

FWIW, could check if a source file is utf-8 easily enough. I wrote
a command to do that based on some code from the web a while ago and
it seemed to work OK for what I needed it for.

So read your suspect file in binary mode, call "string_is_utf" on it and
if it is, you're good to source it.

(If it isn't you can probably apply some more heuristics on the
string to guess what it actually is).

==
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <tcl.h>

#ifdef WIN32
#include <io.h>
#define TCL_API __declspec(dllexport)
#else
#include <unistd.h>
#define TCL_API
#endif

#ifdef WIN32
#define dup _dup
#define fileno _fileno
#define fdopen _fdopen
#define close _close
#endif

static char rcsid[] = "$Id$ TN";

/*
* Function prototypes
*/
TCL_API int Isutf_Init(Tcl_Interp *interp);

static int isutf_string_is_utf(ClientData clientData, Tcl_Interp *interp,
int objc, Tcl_Obj *CONST objv[]);

/*
* This decoder by Bjoern Hoermann is the simplest I've found. It also works
* by feeding it a single byte, as well as keeping a state. The state is
* very useful for parsing UTF8 coming in in chunks over the network.
*
* http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
*
*/

// Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>
// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.

#define UTF8_ACCEPT 0
#define UTF8_REJECT 1

static const uint8_t utf8d[] = {

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf
8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df
0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef
0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff
0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2
1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4
1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6
1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};

#if 0
static uint32_t decode(uint32_t* state, uint32_t* codep, uint32_t byte) {

uint32_t type = utf8d[byte];

*codep = (*state != UTF8_ACCEPT) ?
(byte & 0x3fu) | (*codep << 6) :
(0xff >> type) & (byte);

*state = utf8d[256 + *state*16 + type];

return *state;
}
#endif

/*
*
* A simple validator/detector doesn't need the code point,
* so it could be written like this (Initial state is set to UTF8_ACCEPT):
*
*/
static uint32_t validate_utf8(uint32_t *state, unsigned char *str, size_t len) {

size_t i;
uint32_t type;

for (i = 0; i < len; i++) {
// We don't care about the codepoint, so this is
// a simplified version of the decode function.
type = utf8d[(uint8_t)str[i]];
*state = utf8d[256 + (*state) * 16 + type];

if (*state == UTF8_REJECT)
break;
}

return *state;
}

/*
* If the text is valid utf8 UTF8_ACCEPT is returned. If it's
* invalid UTF8_REJECT. If more data is needed, some other integer is returned.
*
*/

/*
* Init everything
*/

int Isutf_Init(Tcl_Interp *interp){

#ifdef USE_TCL_STUBS
Tcl_InitStubs(interp, "8.6", 0);
#endif
Tcl_CreateObjCommand(interp, "string_is_utf",
isutf_string_is_utf, (ClientData)NULL,
(Tcl_CmdDeleteProc *) NULL);

Tcl_PkgProvide(interp,"Isutf", "1.0");

return 0;
}

static int isutf_string_is_utf(ClientData clientData,
Tcl_Interp *interp, int objc, Tcl_Obj *CONST objv[])
{

unsigned char *bytes;
int bytelen;
char buf[1024];
Tcl_Obj *resultPtr;
uint32_t state = UTF8_ACCEPT;

resultPtr = Tcl_GetObjResult(interp);

if(objc != 2){
Tcl_WrongNumArgs(interp, 1, objv,
"Usage: string_is_utf binary_string");
return TCL_ERROR;
}

bytes = Tcl_GetByteArrayFromObj(objv[1], &bytelen);

#if 0
if(bytelen == 0) {

sprintf(buf, "0 bytes in object!");
Tcl_SetStringObj(resultPtr, buf, -1);
return TCL_ERROR;
}
#endif

if (validate_utf8(&state, bytes, bytelen) == UTF8_REJECT) {
sprintf(buf, "Invalid UTF8 data!");
Tcl_SetStringObj(resultPtr, buf, -1);
Tcl_SetIntObj(resultPtr, 0);
return TCL_OK;
}

Tcl_SetIntObj(resultPtr, 1);
return TCL_OK;
}
--
columbiaclosings.com
What's not in Columbia anymore..

Date	Sujet	#	Auteur
13 Dec 24	Tcl9: source files are interpreted as utf-8 by default	40	Uwe Schmitz
13 Dec 24	Re: Tcl9: source files are interpreted as utf-8 by default	2	Harald Oehlmann
13 Dec 24	Re: Tcl9: source files are interpreted as utf-8 by default	1	Uwe Schmitz
7 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	37	Uwe Schmitz
7 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	5	Harald Oehlmann
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	4	Uwe Schmitz
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	3	Harald Oehlmann
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	2	Harald Oehlmann
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	1	Uwe Schmitz
7 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	31	Luc
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	30	Uwe Schmitz
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	29	Luc
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	28	Luc
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	27	Uwe Schmitz
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	26	Luc
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	25	Rich
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	24	Luc
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	23	Rich
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	12	Luc
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	1	ted@loft.tnolan.com (Ted Nolan
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	10	Rich
9 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	9	Uwe Schmitz
9 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	8	Rich
9 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	6	Luc
10 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	5	Rich
10 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	4	Luc
10 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	3	eric
10 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	2	Luc
10 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	1	Rich
10 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	1	Uwe Schmitz
8 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	10	saito
9 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	9	Rich
9 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	8	Luc
9 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	7	Rich
9 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	6	Luc
9 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	4	Uwe Schmitz
9 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	3	Harald Oehlmann
10 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	2	Uwe Schmitz
10 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	1	Harald Oehlmann
9 Jan 25	Re: Tcl9: source files are interpreted as utf-8 by default	1	Rich