Re: ASCII to ASCII compression.

Liste des GroupesRevenir à cl c  
Sujet : Re: ASCII to ASCII compression.
De : bohannonindustriesllc (at) *nospam* gmail.com (BGB-Alt)
Groupes : comp.lang.c
Date : 10. Jun 2024, 22:26:32
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v47r29$kqit$1@dont-email.me>
References : 1 2 3 4 5 6
User-Agent : Mozilla Thunderbird
On 6/7/2024 3:57 PM, Paul wrote:
On 6/7/2024 8:43 AM, Malcolm McLean wrote:
On 07/06/2024 10:36, David Brown wrote:
On 06/06/2024 21:02, Malcolm McLean wrote:
On 06/06/2024 17:55, bart wrote:
On 06/06/2024 17:25, Malcolm McLean wrote:
>
Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.
>
Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?
>
So "Mary had a little lamb, its fleece was white as snow".
>
Would become
>
QWE£$543GtT£$"||x|VVBB?
>
What's the problem with compressing to binary (using existing, efficient utilities), then turning that binary into ASCII (like Mime or Base64)?
>
Because if a single bit flips in a zip archive, it's likely the entire archive will be lost. This scheme is robust. We can emed compressed text in programs, and if it is corruped, only a single line will become unreadable.
>
Ah, you want something that will work like your newsreader program that randomly changes letters or otherwise corrupts your spelling while leaving most of it readable?  :-)
>
Pass the data through a compressor and then add forward error checking mechanisms such as Reed-Solomon codes.  Then convert to ASCII base64 or similar.
>
Yes, exactly.
>
I want a system for compression which is robust to corruption, can be stored as text, and with a compressor / decompressor which can be written by a child hobby programmer with only a very little bit of experience of programming.
>
That's what I need for Baby X. The FileSystem XML files can get very large, and of course Baby X programmers are going to ask about compression. And I don't think there is an existing system, and so I shall devise one.
>
 "XML Compression"
 https://link.springer.com/referenceworkentry/10.1007/978-1-4899-7993-3_783-2
     "The size increase incurred by publishing data in XML format is
     estimated to be as much as 400 % [14], making it a prime target for compression.
      While standard general-purpose compressors, such as
     zip, gzip or bzip, typically compress XML data reasonably well...
    "
 Show us a "dir" or an "ls -al" so we can better understand
the magnitude of what you're working on.
 Lots of things have used ZIP, implicitly or explicitly, mainly
because it is a kind of standard and does not form a barrier to access.
 In addition, if a structure is voluminous (a thousand control files
representing one project), users appreciate having them stored in
a container, rather than filling the file system with fluff. A ZIP
can do that too. And if the ZIP has a convenient library you can
get from FOSS-land, that could save time on building a standards
based container.
 
One downside of ZIP is that it is a moderately heavyweight format to work with.
For some of my own uses, I had created "WAD2A" and "WAD4" formats which can address similar use cases, but without some of the implied overhead of processing the ZIP central directory.
WAD2A is a tweaked version of the WAD2 format (from Quake and Half-Life) which adds support for directory trees, and actually uses the data compression parts. Downside of WAD2A is that non-root lump names are effectively limited to 12 characters (vs the 16-char name limit for root lumps).
The WAD4 format was similar, but expanded the dirent size, and had 32-character lump names, also organized into a directory tree.
Also generally, I had used LZ4 and my own RP2 compression, rather than Deflate, because Deflate is also fairly expensive (particularly on a 50MHz CPU); mostly due to the relatively high cost of setting up Huffman tables, and also decoding data with them.
Where, RP2 is also a byte-oriented LZ compressor (like LZ4), but generally getting slightly better compression (for general purpose data) at a similar decode speed (though, I have noted that LZ4 does better for some other types of data, such as machine code, so I ended up mostly using LZ4 for compressing things like program binaries).
Curiously, LZ4 seems to do better with both my own ISA and with RISC-V, so there is something in the typical compiler output that favors LZ4.
I had also implemented a few simpler Huffman based formats, but can't really get up to similar speeds.
Had also come up with a sort of "pseudo entropic" encoding, which managed to still gain some compression in past tests (while also being faster than an "actual" entropy coding scheme, and was still byte-oriented).
IIRC:
   Rank symbols based on probability, encode as indices into table.
   00..7F: Encode a symbol, 0..127
   80..F8: Encode a symbol Pair (0..11)
   FF: Escape code a symbol (byte)
Had considered another possibility:
   0000..3FFF: Symbol Pair (0..127)
   4000..7D08: Symbol Triple (0..25)
   7F00..7FFF: Single Symbol (0..255)
   8000..FFFF: Symbol Quad (0..13)
But, didn't get around to experimenting with this.
Downside of these schemes is that division-by-constant eats a lot of the potential speed gains over Huffman (it can be turned into multiply by reciprocal, but this is still "not very fast"; if it were doing general-purpose division, it would be a dead loss).
Similarly, for the latter form, it would be too large to use a lookup table (trying to do so would likely also eat most of its performance), though since each table lookup potentially does multiple symbols, it would not necessarily be slower than Huffman in this case.
Some alternate twiddly could be possible if one were assuming the use of specialized CPU helper instructions to pack/unpack the indices (doing tricks similar to Decimal / DPD encoding, rather than using multiply-by-reciprocal trickery). But, probably not worthwhile (and would likely make it slower for a pure software decoder, except in the 8-bit case which could use a lookup table).

But what's more important than any techie adventure, is not
annoying your users. What do the users want most ? The ability
to edit the files in question, on a moments notice ? Or would
the files, 99.999% of the time, comfortably remain hidden from view ?
 If the "blob" involved was 100GB, then yes, I'd compress it :-)
If it is 4KB, well, those little files are a nuisance no matter
what you do. I would leave that uncompressed, unless I could
containerize it perhaps.
 As an example, Mozilla has used .jsonlz4 as a file format solution.
I have no idea what problem they thought they were solving,
but I can tell you I consider the solution obnoxious and inconsiderate
of the user. LZ4 decompressors are not a stockroom item. I had
to write a very short program, so I could deal with that. Mozilla
has made a perfect example of what not to do, by doing that.
 
LZ4 is a fairly simple format though, so a person can implement it in a few hundred lines of C if needed.

    Paul

Date Sujet#  Auteur
6 Jun 24 * ASCII to ASCII compression.42Malcolm McLean
6 Jun 24 +* Re: ASCII to ASCII compression.12bart
6 Jun 24 i+* Re: ASCII to ASCII compression.3Michael S
17 Jun 24 ii`* Re: ASCII to ASCII compression.2Lawrence D'Oliveiro
17 Jun 24 ii `- Re: ASCII to ASCII compression.1Michael S
6 Jun 24 i`* Re: ASCII to ASCII compression.8Malcolm McLean
6 Jun 24 i +- Re: ASCII to ASCII compression.1Keith Thompson
7 Jun 24 i +- Re: ASCII to ASCII compression.1Mikko
7 Jun 24 i `* Re: ASCII to ASCII compression.5David Brown
7 Jun 24 i  `* Re: ASCII to ASCII compression.4Malcolm McLean
7 Jun 24 i   +- Re: ASCII to ASCII compression.1David Brown
7 Jun 24 i   `* Re: ASCII to ASCII compression.2Paul
10 Jun 24 i    `- Re: ASCII to ASCII compression.1BGB-Alt
6 Jun 24 +* Re: ASCII to ASCII compression.10Ben Bacarisse
6 Jun 24 i`* Re: ASCII to ASCII compression.9Malcolm McLean
7 Jun 24 i `* Re: ASCII to ASCII compression.8Mikko
7 Jun 24 i  `* Re: ASCII to ASCII compression.7Malcolm McLean
7 Jun 24 i   +* Re: ASCII to ASCII compression.5Mikko
7 Jun 24 i   i+- Re: ASCII to ASCII compression.1BGB
7 Jun 24 i   i`* Re: ASCII to ASCII compression.3Malcolm McLean
7 Jun 24 i   i `* Re: ASCII to ASCII compression.2Richard Harnden
8 Jun 24 i   i  `- Re: ASCII to ASCII compression.1Malcolm McLean
7 Jun 24 i   `- Re: ASCII to ASCII compression.1Chris M. Thomasson
6 Jun 24 +- Re: ASCII to ASCII compression.1Kaz Kylheku
6 Jun 24 +* Re: ASCII to ASCII compression.7Paul
6 Jun 24 i`* Re: ASCII to ASCII compression.6Malcolm McLean
6 Jun 24 i +* Re: ASCII to ASCII compression.2bart
7 Jun 24 i i`- Re: ASCII to ASCII compression.1Paul
10 Jun 24 i `* Re: ASCII to ASCII compression.3Lowell Gilbert
10 Jun 24 i  `* Re: ASCII to ASCII compression.2Malcolm McLean
10 Jun 24 i   `- Re: ASCII to ASCII compression.1bart
7 Jun 24 +* Re: ASCII to ASCII compression.4Mikko
7 Jun 24 i`* Re: ASCII to ASCII compression.3Malcolm McLean
9 Jun 24 i `* Re: ASCII to ASCII compression.2Michael S
9 Jun 24 i  `- Re: ASCII to ASCII compression.1Malcolm McLean
10 Jun 24 `* Re: ASCII to ASCII compression.7Lew Pitcher
10 Jun 24  `* Re: ASCII to ASCII compression.6Malcolm McLean
10 Jun 24   +- Re: ASCII to ASCII compression.1Michael S
10 Jun 24   `* Re: ASCII to ASCII compression.4Ben Bacarisse
10 Jun 24    `* Re: ASCII to ASCII compression.3Malcolm McLean
10 Jun 24     `* Re: ASCII to ASCII compression.2Ben Bacarisse
10 Jun 24      `- Re: ASCII to ASCII compression.1Malcolm McLean

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal