Re: Tcl9: source files are interpreted as utf-8 by default

Liste des GroupesRevenir à cl tcl 
Sujet : Re: Tcl9: source files are interpreted as utf-8 by default
De : (at) *nospam* ednolan (ted@loft.tnolan.com (Ted Nolan)
Groupes : comp.lang.tcl
Date : 08. Jan 2025, 23:06:24
Autres entêtes
Organisation : loft
Message-ID : <lu8b6vFhjecU1@mid.individual.net>
References : 1 2 3 4
User-Agent : trn 4.0-test76 (Apr 2, 2001)
In article <20250108172312.253b829c@lud1.home>, Luc  <luc@sep.invalid> wrote:
On Wed, 8 Jan 2025 19:32:24 -0000 (UTC), Rich wrote:
>
Instead of main.tcl sourcing set_encoding.tcl, starter.tcl runs some
'encoding' command then sources main.tcl. Basically, a wrapper. 
>
Yes, that works.  But then Uwe has to go and "wrapperize" all the
various scripts, on all the various client systems.  So he's back in
the same boat of "major modifications need be made now" as changing all
the launching instances to launch with "-encoding iso-8859".
>
True, but he has considered that kind of effort. His words:
>
>
"That means we have to add "-encoding iso8859-1"
to ALL source and ALL tclsh calls in ALL scripts.
So far, so good(or bad?)."
>
"What initially seems quite doable, looks more and more scary
to me. First, if we ever may switch encoding to utf-8 we
have to alter all those lines again."
>
>
So in my mind, the "customer" accepts (though grudgingly) making
large scale changes, but is concerned with possible new changes
in the future. A wrapper can handle the future quite gracefully.
>
>
I've resisted pointing this one out, but long term, yes, updating all
the scripts to be utf-8 encoded is the right, long term, answer.  But
that belies all the current, short term effort, involved in doing so.
>
Actually, when I mentioned my migration case, I was also thinking that
I could afford to do it because I was migrating to Linux and utf-8 was
not even the future anymore, it was pretty much the present. But maybe
running iconv wouldn't be acceptable because Uwe is (I assume) on
Windows. Does a Windows user want to convert his files to utf-8?
Won't that cause problems if the system is iso-8859-1? Windows still
uses iso-8859-1, right?
>
So yes, I guess Tcl9 causes trouble to 8859-1 users. Yes, sounds like
it needs some fixing.
>
More suggestions: how about not using Tcl9 just yet? I'm stil on 8.6
and the water is fine. Early adopters tend to pay a price. In my case,
absent packages.
>
I have my own special case, I use Debian 9 which only ships 8.6.6 so
I had to build 8.6.15 from source because I really need Unicode.
But for some time I used Freewrap as a single-file batteries included
Tcl/Tk interpreter. So maybe Uwe should just use a different interpreter,
likely just a slightly older version of Tcl/Tk and embrace Tcl9 later.
>
I wonder if one can hack the encoding issue on the Tcl9 source and
rebuild it.
>
>
--
Luc
>
>

FWIW, could check if a source file is utf-8 easily enough.  I wrote
a command to do that based on some code from the web a while ago and
it seemed to work OK for what I needed it for.

So read your suspect file in binary mode, call "string_is_utf" on it and
if it is, you're good to source it.

(If it isn't you can probably apply some more heuristics on the
string to guess what it actually is).

==
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <tcl.h>

#ifdef WIN32
#include <io.h>
#define TCL_API __declspec(dllexport)
#else
#include <unistd.h>
#define TCL_API
#endif

#ifdef WIN32
#define dup _dup
#define fileno _fileno
#define fdopen _fdopen
#define close _close
#endif

static char rcsid[] = "$Id$ TN";


/*
 * Function prototypes
 */
TCL_API int Isutf_Init(Tcl_Interp *interp);

static int isutf_string_is_utf(ClientData clientData, Tcl_Interp *interp,
int objc, Tcl_Obj *CONST objv[]);


/*
 * This decoder by Bjoern Hoermann is the simplest I've found. It also works
 * by feeding it a single byte, as well as keeping a state. The state is
 * very useful for parsing UTF8 coming in in chunks over the network.
 *
 * http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
 *
 */


// Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>
// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.

#define UTF8_ACCEPT 0
#define UTF8_REJECT 1

static const uint8_t utf8d[] = {

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf
8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df
0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef
0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff
0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2
1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4
1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6
1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};

#if 0
static uint32_t decode(uint32_t* state, uint32_t* codep, uint32_t byte) {

uint32_t type = utf8d[byte];

*codep = (*state != UTF8_ACCEPT) ?
(byte & 0x3fu) | (*codep << 6) :
(0xff >> type) & (byte);

*state = utf8d[256 + *state*16 + type];

return *state;
}
#endif

/*
 *
 * A simple validator/detector doesn't need the code point,
 * so it could be written like this (Initial state is set to UTF8_ACCEPT):
 *
 */
static uint32_t validate_utf8(uint32_t *state, unsigned char *str, size_t len) {

size_t i;
uint32_t type;

for (i = 0; i < len; i++) {
// We don't care about the codepoint, so this is
// a simplified version of the decode function.
type = utf8d[(uint8_t)str[i]];
*state = utf8d[256 + (*state) * 16 + type];

if (*state == UTF8_REJECT)
break;
}

return *state;
}

/*
 * If the text is valid utf8 UTF8_ACCEPT is returned. If it's
 * invalid UTF8_REJECT. If more data is needed, some other integer is returned.
 *
 */


/*
 * Init everything
 */

int Isutf_Init(Tcl_Interp *interp){

#ifdef USE_TCL_STUBS
Tcl_InitStubs(interp, "8.6", 0);
#endif
Tcl_CreateObjCommand(interp, "string_is_utf",
isutf_string_is_utf, (ClientData)NULL,
(Tcl_CmdDeleteProc *) NULL);

Tcl_PkgProvide(interp,"Isutf", "1.0");

return 0;
}





static int isutf_string_is_utf(ClientData clientData,
Tcl_Interp *interp, int objc, Tcl_Obj *CONST objv[])
{

unsigned char *bytes;
int bytelen;
char buf[1024];
Tcl_Obj *resultPtr;
uint32_t state = UTF8_ACCEPT;

resultPtr = Tcl_GetObjResult(interp);

if(objc != 2){
Tcl_WrongNumArgs(interp, 1, objv,
"Usage: string_is_utf binary_string");
return TCL_ERROR;
}

bytes = Tcl_GetByteArrayFromObj(objv[1], &bytelen);

#if 0
if(bytelen == 0) {

sprintf(buf, "0 bytes in object!");
Tcl_SetStringObj(resultPtr, buf, -1);
return TCL_ERROR;
}
#endif

if (validate_utf8(&state, bytes, bytelen) == UTF8_REJECT) {
sprintf(buf, "Invalid UTF8 data!");
Tcl_SetStringObj(resultPtr, buf, -1);
Tcl_SetIntObj(resultPtr, 0);
return TCL_OK;
}

Tcl_SetIntObj(resultPtr, 1);
return TCL_OK;
}
--
columbiaclosings.com
What's not in Columbia anymore..

Date Sujet#  Auteur
13 Dec 24 * Tcl9: source files are interpreted as utf-8 by default40Uwe Schmitz
13 Dec 24 +* Re: Tcl9: source files are interpreted as utf-8 by default2Harald Oehlmann
13 Dec 24 i`- Re: Tcl9: source files are interpreted as utf-8 by default1Uwe Schmitz
7 Jan 25 `* Re: Tcl9: source files are interpreted as utf-8 by default37Uwe Schmitz
7 Jan 25  +* Re: Tcl9: source files are interpreted as utf-8 by default5Harald Oehlmann
8 Jan 25  i`* Re: Tcl9: source files are interpreted as utf-8 by default4Uwe Schmitz
8 Jan 25  i `* Re: Tcl9: source files are interpreted as utf-8 by default3Harald Oehlmann
8 Jan 25  i  `* Re: Tcl9: source files are interpreted as utf-8 by default2Harald Oehlmann
8 Jan 25  i   `- Re: Tcl9: source files are interpreted as utf-8 by default1Uwe Schmitz
7 Jan 25  `* Re: Tcl9: source files are interpreted as utf-8 by default31Luc
8 Jan 25   `* Re: Tcl9: source files are interpreted as utf-8 by default30Uwe Schmitz
8 Jan 25    `* Re: Tcl9: source files are interpreted as utf-8 by default29Luc
8 Jan 25     `* Re: Tcl9: source files are interpreted as utf-8 by default28Luc
8 Jan 25      `* Re: Tcl9: source files are interpreted as utf-8 by default27Uwe Schmitz
8 Jan 25       `* Re: Tcl9: source files are interpreted as utf-8 by default26Luc
8 Jan 25        `* Re: Tcl9: source files are interpreted as utf-8 by default25Rich
8 Jan 25         `* Re: Tcl9: source files are interpreted as utf-8 by default24Luc
8 Jan 25          `* Re: Tcl9: source files are interpreted as utf-8 by default23Rich
8 Jan 25           +* Re: Tcl9: source files are interpreted as utf-8 by default12Luc
8 Jan 25           i+- Re: Tcl9: source files are interpreted as utf-8 by default1ted@loft.tnolan.com (Ted Nolan
8 Jan 25           i`* Re: Tcl9: source files are interpreted as utf-8 by default10Rich
9 Jan 25           i `* Re: Tcl9: source files are interpreted as utf-8 by default9Uwe Schmitz
9 Jan 25           i  `* Re: Tcl9: source files are interpreted as utf-8 by default8Rich
9 Jan 25           i   +* Re: Tcl9: source files are interpreted as utf-8 by default6Luc
10 Jan 25           i   i`* Re: Tcl9: source files are interpreted as utf-8 by default5Rich
10 Jan 25           i   i `* Re: Tcl9: source files are interpreted as utf-8 by default4Luc
10 Jan 25           i   i  `* Re: Tcl9: source files are interpreted as utf-8 by default3eric
10 Jan 25           i   i   `* Re: Tcl9: source files are interpreted as utf-8 by default2Luc
10 Jan 25           i   i    `- Re: Tcl9: source files are interpreted as utf-8 by default1Rich
10 Jan 25           i   `- Re: Tcl9: source files are interpreted as utf-8 by default1Uwe Schmitz
8 Jan 25           `* Re: Tcl9: source files are interpreted as utf-8 by default10saito
9 Jan 25            `* Re: Tcl9: source files are interpreted as utf-8 by default9Rich
9 Jan 25             `* Re: Tcl9: source files are interpreted as utf-8 by default8Luc
9 Jan 25              `* Re: Tcl9: source files are interpreted as utf-8 by default7Rich
9 Jan 25               `* Re: Tcl9: source files are interpreted as utf-8 by default6Luc
9 Jan 25                +* Re: Tcl9: source files are interpreted as utf-8 by default4Uwe Schmitz
9 Jan 25                i`* Re: Tcl9: source files are interpreted as utf-8 by default3Harald Oehlmann
10 Jan 25                i `* Re: Tcl9: source files are interpreted as utf-8 by default2Uwe Schmitz
10 Jan 25                i  `- Re: Tcl9: source files are interpreted as utf-8 by default1Harald Oehlmann
9 Jan 25                `- Re: Tcl9: source files are interpreted as utf-8 by default1Rich

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal