AWK As A Major Systems Programming Language

Liste des GroupesRevenir à c misc 
Sujet : AWK As A Major Systems Programming Language
De : bencollver (at) *nospam* tilde.pink (Ben Collver)
Groupes : comp.misc
Date : 18. Aug 2024, 01:28:21
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <slrnvc2fsg.gg1.bencollver@svadhyaya.localdomain>
User-Agent : slrn/1.0.3 (Linux)
AWK As A Major Systems Programming Language
===========================================
by Arnold Robbins, June, 2024

Preface
=======
I started this paper in 2013, and in 2015 sent it out for review to
the people listed later on. After incorporating comments, I sent it
to Rik Farrow, the editor of the USENIX magazine ;login: to see if he
would publish it. He declined to do so, for reasonably good reasons.

The paper languished, forgotten, until early 2018 when I came across
it and decided to polish it off, put it up on GitHub, and make it
available from my home page in HTML.

In 2024, I took a fresh look at it, and decided to polish it a little
bit more.

If you are interested in language design and evolution in general,
and in Awk in particular, I hope you will enjoy reading this paper.
If not, then why are you bothering looking at it now?

Arnold Robbins
Nof Ayalon, ISRAEL
June, 2024

1 Introduction
==============
At the March 1991 USENIX conference, Henry Spencer presented a paper
entitled AWK As A Major Systems Programming Language. In it, he
described his experiences using the original version of awk to write
two significant "systems" programs--a clone for a reasonable subset
of the nroff formatter [1], and a simple parser generator.

He described what awk did well, as well as what it didn't, and
presented a list of things that awk would need to acquire in order to
take the position of a reasonable alternative to C for systems
programming tasks on Unix systems.

In particular, awk lies about in the middle of the spectrum between
C, which is "close to the metal," and the shell, which is quite
high-level. A language at this level that is useful for doing systems
programming is very desirable.

This paper reviews Henry's wish list, and describes some of the
events that have occurred in the Unix/Linux world since 1991. It
presents a case that gawk--GNU Awk--fills most of the major needs
Henry listed way back in 1991, and then describes the author's
opinion as to why other languages have successfully filled the
systems programming role which awk did not. It discusses how the
current version of gawk may finally be able to join the ranks of
other popular, powerful, scripting languages in common use today, and
ends off with some counter-arguments and the author's responses to
them.

Acknowledgements
----------------
Thanks to Andrew Schorr, Henry Spencer, Nelson H.F. Beebe, and Brian
Kernighan for reviewing an earlier draft of this paper.

2 That Was Then ...
===================
In this section we review the state of the Unix world in 1991, as
well as the state of awk, and then list what Henry Spencer saw as
missing for awk.

* The Unix World in 1991
* What Awk Lacked In 1991

2.1 The Unix World in 1991
==========================
Undoubtedly, many readers of this paper were not using computers in
1991, so this section provides the context in which Henry's paper was
written. In March of 1991:

* Commercial Unix systems were the norm, with offerings from AT&T,
  Digital Equipment Corporation, Hewlett Packard, IBM, Sun
  Microsystems, and many others, all vying for market share.
  Microsoft Windows existed, but was primarily a layer on top of
  MS-DOS and was not taken seriously.
* Very few sites still ran the original Bell Labs or direct-from-UCB
  variants of Unix; those did not keep up with the available hardware
  and AT&T was itself trying to succeed in the Unix hardware market.
* GNU/Linux did not exist! Some unencumbered BSD variants were
  available, but they were still under the cloud of the AT&T/UCB law
  suit. [2]
* So-called "new" awk was about 2.5 years old. The book by Aho,
  Weinberger and Kernighan was published in October of 1987, so most
  people knew about new awk, but they just couldn't get it.

  Who could? New awk was available to educational institutions from
  the Bell Labs research group, and to those who had Unix source
  licenses for System V Releases 3.1, 3.2, and 4. By this time,
  source licensees were an extremely rare breed, since the cost for
  commercial licenses had skyrocketed, and even for educational
  licensees it had increased greatly. [3] If I recall correctly, an
  educational license cost around US $1,000, considerably more than
  the earlier Unix licenses.

* PERL [4] existed and was starting to gain in popularity. In 1991,
  "PERL" most likely meant PERL 3 or a very early version of PERL 4.
  The World Wide Web, which was one of the major reasons for PERL's
  growth in popularity, had not yet really taken off.
* Other implementations of new awk were available:
  + MKS Awk for PC systems (MS-DOS).
  + GNU Awk was available and relatively stable, but could not be
    called "solid."

  The problem with the first of these is that source code was not
  available. And the latter came with (to quote Henry) "troublesome
  licenses." (Actually, Henry no longer remembers whether his
  statement about "troublesome licenses" referred to the GPL, or to
  the Bell Labs source licenses.)

* Michael Brennan's mawk (also GPL'ed) was not yet available. Version
  1.0 was accepted for posting in comp.sources.reviewed on September
  30, 1991, half a year after Henry's paper was published.

2.2 What Awk Lacked In 1991
===========================
Here is a summary of what was wrong with the awk picture in 1991.
These are in the same order as presented Henry's paper. We qualify
each issue in order to later discuss how it has been addressed over
time.

* New awk was not widely available. Most Unix vendors still shipped
  only old awk. (Here is where he mentions that "the
  independently-available implementations either cost substantial
  amounts of money or come with troublesome [sic] licenses.") His
  point then was that for portability, awk programs had to be
  restricted to old awk.

  This could be considered a quality of implementation issue,
  although it's really a "lack of available implementation"
  issue.

* There is no way to tell awk to start matching all its patterns over
  again against the existing $0. This is a language design issue.
* There is no array assignment. (Language design issue.)
* Getting an error message out to standard error is difficult.
  (Implementation issue.)
* There is no precise language specification for awk. This leads to
  gratuitous portability problems. This too is thus a quality of
  implementation issue, in that without a specification, it's
  difficult to produce uniform, high quality implementations.
* The existing widely available implementation is slow; a much faster
  implementation is needed and the best thing of all would be an
  optimizing compiler. (Implementation issue.)
* There is no awk-level debugger. (Support tool or quality of
  implementation issue.)
* There is no awk-level profiler. (Support tool or quality of
  implementation issue.)

In private email, Henry added the following items, saying "there are
a couple more things I'd add now, in hindsight." These are direct
quotes:

* [I can't believe I didn't discuss this in the paper, because I was
  certainly aware of it then!] Lack of any convenient mechanism for
  adding libraries. When awk is being invoked from a shell file, the
  shell file can do substitutions or use multiple -f options, but
  those are mechanisms outside the language, and not very convenient
  ones. What's really wanted is something like you get in Python
  etc., where one little statement up near the top says "arrange for
  this program to have the xyz library available when it runs."
* I think it was Rob Pike who later said (roughly): "It says
  something bad about Awk that in a language with integrated regular
  expressions, you end up using substr() so often." My paper did
  allude to the difficulty of finding out where something matched in
  old-awk programs, but even in new awk, what you get is a number
  that you then have to feed to substr(). The language could really
  use some more convenient way of dissecting a string using regexp
  matching. [Caveat: I have not looked lately at Gawk to see if it
  has one.]

The first of these is somewhere between a language design and a
language implementation issue. The latter is a language design issue.

3 ... And This Is Now
=====================
Fast forward to 2024. Where do things stand?

* What Awk Has Today
* And What GNU Awk Has Today
* So Where Does Awk Stand?

3.1 What Awk Has Today
======================
The state of the awk world is much better now. In the same order:

* New awk is the standard version of awk today on GNU/Linux, BSD, and
  commercial Unix systems. The one notable exception is Solaris,
  where /usr/bin/awk is still the old one; on all other systems,
  plain awk is some version of new awk.
* There remains no way to tell awk to start matching all its patterns
  over again against the existing $0. Furthermore, this is a feature
  that has not been called for by the awk community, except in
  Henry's paper. (We do acknowledge that this might be a useful
  feature.)
* There continues to be no array assignment. However, this function
  in gawk, which has arrays of arrays, can do the trick nicely. It is
  also efficient, since gawk uses reference counted strings
  internally:

    function copy_array(dest, source,   i, count)
    {
        delete dest

        for (i in source) {
            if (typeof(source[i]) == "array")
                count += copy_array(dest[i], source[i])
            else {
                dest[i] = source[i]
                count++
            }
        }

        return count
    }

* Getting error messages out is easier. All modern systems have a
  /dev/stderr special file to which error messages may be sent
  directly. gawk, mawk and Brian Kernighan's awk all have
  "/dev/stderr" built in for I/O redirections, so even on systems
  without a real /dev/stderr special file, you can still send error
  messages to standard error.
* Perhaps most important of all, with the POSIX standard, there is a
  formal standard specification for awk. As with all formal
  standards, it isn't perfect. But it provides an excellent starting
  point, as well as chapter and verse to cite when explaining the
  behavior of a standards-compliant version of awk.
  <https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html>

  Additionally, the second edition of The AWK Progamming Language is
  now available.

* There are a number of freely available implementations, with
  different licenses, such that everyone ought to be able to find a
  suitable one:

  * Brian Kernighan's awk is the direct lineal descendant of Unix
    awk. He calls it the "One True Awk" (sic). It is available from
    Github:

        $ git clone git://github.com/onetrueawk/awk bwkawk

  * GNU Awk, gawk, is available from the Free Software Foundation.
    You may use an HTTPS downloader:
    <https://ftp.gnu.org/gnu/gawk/gawk-5.3.0.tar.gz> is the current
    version. There may be a newer one.

  * Michael Brennan's awk, known as mawk. In 2009, Thomas Dickey took
    on mawk maintenance. Basic information is available on the
    project's web page. The download URL is
    <https://invisible-island.net/datafiles/release/mawk.tar.gz>

    In 2017 Michael published a beta of mawk 2.0. It's available from
    the project's GitHub page.
    <https://github.com/mikebrennan000/mawk-2>

  * MKS Awk was used for Solaris's /usr/xpg4/bin/awk, which is their
    standards-compliant version of new awk. For a while it was
    available as part of Open Solaris, but is no longer so. Some
    years ago, we were able to make this version compile and run on
    GNU/Linux after just a few hours work.

    Although Open Solaris is now history, the Illumos project does
    make the MKS Awk available. You can view the files one at a time
    from
    <https://github.com/joyent/illumos-joyent/blob/master/usr/src/
     cmd/awk_xpg4>
    <https://illumos.org/>

  * Other, more esoteric versions as well. See the Wikipedia article,
    and also the gawk documentation.
    <https://en.wikipedia.org/wiki/Awk_language
     #Versions_and_implementations>
    <https://www.gnu.org/software/gawk/manual/html_node/
     Other-Versions.html#Other-Versions>

3.2 And What GNU Awk Has Today
==============================
The more difficult of the quality of implementation issues are
addressed by gawk. In particular:

* Beginning with version 4.0 in 2011, gawk provides an awk-level
  debugger which is modeled after GDB. This is a full debugger, with
  breakpoints, watchpoints, single statement stepping and expression
  evaluation capabilities. (Older versions had a separate executable
  named dgawk. Today it's built into regular gawk.)
* gawk has provided an awk-level statement profiler for many years
  (pgawk). Although there is no direct correlation with CPU time
  used, the statement level profiler remains a powerful tool for
  understanding program behavior.
* Since version 4.0, gawk has had an '@include' facility whereby gawk
  goes and finds the named awk source progrm. For much longer it has
  searched for files specified with -f along the path named by the
  AWKPATH environment variable. The '@include' mechanism also uses
  AWKPATH.
* In terms of getting at the pieces of text matched by a regular
  expression, gawk provides an optional third argument to the match()
  function. This argument is an array which gawk fills in with both
  the matched text for the full regexp and subexpressions, and index
  and length information for use with substr(). gawk also provides
  the gensub() general substitution function, an enhanced version of
  the split() function, and the patsplit() function for specifying
  contents instead of separators using a regexp.

While gawk has almost always been faster than Brian Kernighan's awk,
performance improvements bring it closer to mawk's performance level
(a byte-code based execution engine and internal improvements in
array indexing).

And gawk clearly has the most features of any version, many of which
considerably increase the power of the language.

3.3 So Where Does Awk Stand?
============================
Despite all of the above, gawk is not as popular as other scripting
languages. Since 1991, we can point to four major scripting languages
which have enjoyed, or currently enjoy, differing levels of
popularity: PERL, tcl/tk, Python, and Ruby. We think it is fair to
say that Python is the most popular scripting languages in the third
decade of the 21st century.

Is awk, as we've described it up to this point, now ready to compete
with the other languages? Not quite yet.

4 Key Reasons Why Other Languages Have Gained Popularity
========================================================
In retrospect, it seems clear (at least to us!) that there are two
major reasons that all of the previously mentioned languages have
enjoyed significant popularity. The first is their extensibility. The
second is namespace management.

One certainly cannot attribute their popularity to improved syntax.
In the opinion of many, PERL and Ruby both suffer from terrible
syntax. Tcl's syntax is readable but nothing special. Python's syntax
is elegant, although slightly unusual. The point here is that they
all differ greatly in syntax, and none really offers the clean
pattern–action paradigm that is awk's trademark, yet they are all
popular languages.

If not syntax, then what? We believe that their popularity stems from
the fact that all of these languages are easily extensible. This is
true with both "modules" in the scripting language, and more
importantly, with access to C level facilities via dynamic library
loading.

Furthermore, these languages allow you to group related functions and
variables into packages or modules: they let you manage the namespace.

awk, on the other hand, has always been closed. An awk program cannot
even change its working directory, much less open a connection to an
SQL database or a socket to a server on the Internet somewhere
(although gawk can do the latter).

If one examines the number of extensions available for PERL on CPAN,
or for Python such as PyQt or the Python tk bindings, it becomes
clear that extensibility is the real key to power (and from there to
popularity).

Furthermore, in awk, all global variables and functions share a
single namespace. This prevents many good software development
practices based on the principle of information hiding.

To summarize: A reasonable language definition, efficient
implementations, debuggers and profilers are necessary but not
sufficient for true power. The final ingredients are extensibility
and namespaces.

5 Filling The Extensibility Gap
===============================
With version 4.1, gawk (finally) provides a defined C API for
extending the core language.

* API Overview
* Discussion
* Future Work

5.1 API Overview
================
The API makes it possible to write functions in C or C++ that are
callable from an awk program as if the function were written in awk.
The most straightforward way to think of these functions is as
user-defined functions that happen to be implemented in a different
language.

The API provides the following facilities:

* Structures that map awk string, numeric, and undefined values into
  C types that can be worked with.
* Management of function parameters, including the ability to convert
  a parameter whose original type is undefined, into an array. That
  is, there is full call-by-reference for arrays. Scalars are passed
  by value, of course.
* Access to the symbol table. Extension functions can read all awk
  variables, and create and update new variables. As an initial,
  relatively arbitrary design decision, extensions cannot update
  special variables such as NR or NF, with the single exception of
  PROCINFO.
* Full array management, including the ability to create arrays, and
  arrays of arrays, and the ability to add and delete elements from
  an array. It is also possible to "flatten" an array into a data
  structure that makes it simple for C code to loop over all the
  elements of an array.
* The ability to run a procedure when gawk exits. This is
  conceptually the same as the C atexit() function.
* Hooks into the built-in I/O redirection mechanisms in gawk. In
  particular, there are separate facilities for input redirections
  with getline and '<', output redirections with print or printf and
  '>' or '>>', and two-way pipelines with gawk's '|&' operator.

5.2 Discussion
==============
Considerable thought went into the design of the API. The gawk
documentation provides a full description of the API itself, with
examples (over 50 pages worth!), as well as some discussion of the
goals and design decisions behind the API (in an appendix). The
development was done over the course of about a year and a half,
together with the developers of xgawk, a fork of gawk that added
features that made using extensions easier, and included an extension
for processing XML files in a way that fit naturally with the
pattern–action paradigm. While it may not be perfect, the gawk
developers feel that it is a good start.

<https://www.gnu.org/software/gawk/manual/html_node/
Dynamic-Extensions.html#Dynamic-Extensions>

<https://www.gnu.org/software/gawk/manual/html_node/
Extension-Design.html#Extension-Design>

FIXME: Henry Spencer suggests adding more info on the API and on the
design decisions. I think this paper is long enough, and the full doc
is quite big. It'd be hard to pull API doc into this paper in a
reasonable fashion, although it would be possible to review some of
the design decisions. Comments?

The major xgawk additions to the C code base have been merged into
gawk, and the extensions from that project have been rewritten to use
the new API. As a result, the xgawk project developers renamed their
project gawkextlib, and the project now provides only extensions. [5]

It is notable that functions written in awk can do a number of things
that extension functions cannot, such as modify any variables, do
I/O, call awk built-in functions, and call other user-defined
functions.

While it would certainly be possible to provide APIs for all of these
features for extension functions, this seemed to be overkill.
Instead, the gawk developers took the view that extension functions
should provide access to external facilities, and provide
communication to the awk level via function parameters and/or global
variables, including associative arrays, which are the only real data
structure.

Consider a simple example. The standard du program can recursively
walk one or more arbitrary file hierarchies, call stat() to retrieve
file information, and then sum up the blocks used. In the process, du
must track hard links, so that no file is accounted for or reported
more than once.

The 'filefuncs' extension shipped with gawk provides a stat()
function that takes a pathname and fills in an associative array with
the information retrieved from stat(). The array elements have names
like "size", "mtime" and so on, with corresponding appropriate
values. (Compare this to PERL's stat() function that returns a
linearly-indexed array!)

The fts() function in the 'filefuncs' extension builds on stat() to
create a multidimensional array of arrays that describes the
requested file hierarchies, with each element being an array filled
in by stat(). Directories are arrays containing elements for each
directory entry, with an element named "." for the array
itself.

Gven that fts() does the heavy lifting, du can be written quite
nicely, and quite portably [6], in awk. See Awk Code For du, for the
code, which weighs in at under 250 lines. Much of this is comments
and argument parsing.

<http://www.skeeve.com/awk-sys-prog.html#du-in-awk>

5.3 Future Work
===============
The extension facility is relatively new, and undoubtedly has
introduced new "dark corners" into gawk. These remain to be uncovered
and any new bugs need to be shaken out and removed.

Some issues are known and may not be resolvable. For example, 64-bit
integer values such as the timestamps in stat() data on modern
systems don't fit into awk's 64-bit double-precision numbers which
only have 53 bits of significand. This is also a problem for the
bit-manipulation functions.

With respect to namespaces, in 2017 I (finally) figured out how
namespaces in awk ought to work to provide the needed functionality
while retaining backwards compatibility. The was released with gawk
5.0.

One or two of the sample extensions shipped with gawk and in
gawkextlib have been modified to take advantage of namespaces.

6 Counterpoints
===============
Brian Kernighan raised several counterpoints in response to an
earlier draft of the paper. They are worth addressing (or at least
trying to):

I'm not 100% convinced by your basic premise, that the lack of an
extension mechanism is the main / a big reason why Awk isn't used
for the kinds of system programming tasks that Perl, Python, etc.,
are. It's absolutely a factor--without such a mechanism, there's
just no way to do a lot of important computations. But how does
that trade off against just having built-in mechanisms for the core
system programming facilities (as Perl does) or a handful of core
libraries like sys, os, regex, etc., for Python?

I think that Perl's original inclusion of most of the Unix system
calls was, from a language design standpoint, ultimately a mistake.
At the time it was first done, there was no other choice: dynamic
loading of libraries didn't exist on Unix systems in the early and
mid-1980s (nor did shared libraries, for that matter). But having all
those built-in functions bloats the language, making it harder to
learn, document, and maintain, and I definitely did not wish to go
down that path for gawk.

With respect to Python, the question is: how are those libraries
implemented? Are they built-in to the interpreter and separated from
the "core" language simply by the language design? Or are they
dynamically loaded modules?

If the latter, that sounds like an argument for the case of having
extensions, not against it. And indeed, this merely emphasizes the
point made at the end of the previous section, which is that to make
an extension facility really scalable, you also need some sort of
namespace / module capability.

Thus, Brian is correct: an extension facility is needed, but the last
part of the puzzle would be a module facility in the language. I
think that I have solved this, and invite the curious reader to
checkout the current versions of gawk.

I'm also not convinced that Awk is the right language for writing
things that need extensions. It was originally designed for
1-liners, and a lot of its constructs don't scale up to bigger
programs. The notation for function locals is appalling (all my
fault too, which makes it worse). There's little chance to recover
from random spelling mistakes and typos; the use of mere adjacency
for concatenation looks ever more like a bad idea.

This is hard to argue with. Nonetheless, gawk's --lint option may be
of help here, as well as the --dump-variables option which produces a
list of all variables used in the program.

Awk is fine for its original purpose, but I find myself writing
Python for anything that's going to be bigger than say 10-20 lines
unless the lines are basically just longer pattern-action
sequences. (That notation is a win, of course, which you point
out.)

I have worked for several years in Python. For string manipulation
and processing records, you still have to write all the manual stuff:
open the file, read lines in a loop, split them, etc. Awk does all
this stuff for me.

Additionally, I think that with discipline, it's possible to write
fairly good-sized, understandable and maintainable awk programs; in
my experience awk does scale up well beyond the one-liner range.

Not to mention that Brian published (twice now!) a whole book of awk
programs larger than one line. :-) (See the Resources section.)

Some of my own, good-sized awk programs are available from GitHub:

The TexiWeb Jr. literate programming system
-------------------------------------------
See <https://github.com/arnoldrobbins/texiwebjr>. The suite has two
programs that total over 1,300 lines of awk. (They share some code.)

Prepinfo
--------
See <https://github.com/arnoldrobbins/prepinfo>. This script
processes Texinfo files, updating menus as needed. This version is
rewritten in TexiWeb Jr.; it's about 350 lines of awk.

Sortmail
--------
See <https://github.com/arnoldrobbins/sortmail>. This script sorts a
Unix mbox format mailbox by thread. I use it daily. It's also written
in TexiWeb Jr. and is about 330 lines of awk.

Brian continues:

The du example is awfully big, though it does show off some of the
language features. Could you get the same mileage with something
quite a bit shorter?

My definition of "small" and "big" has changed over time. 250 lines
may be big for a script, but the du.awk program is much smaller than
a full implementation in C: GNU du is over 1,100 lines of C, plus all
the libraries it relies upon in the GNU Coreutils.

With respect to shorter examples, nothing springs to mind
immediately. However, gawk comes with several useful extensions that
are worth exploring, much more than we've covered here.

For example, the readdir extension in the gawk distribution causes
gawk to read directories and return one record per directory entry in
an easy-to-parse format:

$ gawk -lreaddir '{ print }' .
-| 2109292/mail.mbx/f
-| 2109295/awk-sys-prog.texi/f
-| 2100007/./d
-| 2100056/texinfo.tex/f
-| 2100055/cleanit/f
-| 2109282/awk-sys-prog.pdf/f
-| 2100009/du.awk/f
-| 2100010/.git/d
-| 2098025/../d
-| 2109294/ChangeLog/f

How cool is that?!? :-)

Also, the gawkextlib project provides some very interesting
extensions. Of particular interest are the XML and JSON extensions,
but there are a number of others, and it's worth checking out.

In 2018 I wrote here:

In short, it's too early to really tell. This is the beginning of
an experiment. I hope it will be a fun journey for me, the other
gawk maintainers, and the larger community of awk users.

In 2024, I have to say that extensions haven't particularly caught
on. This saddens me, but it seems to be typical of awk users that
they use what's in the language and aren't interested in extending
it, or they don't know that they can. Sigh.

7 Conclusion
============
It has taken much longer than any awk fan would like, but finally,
GNU Awk fills in almost all the gaps listed by Henry Spencer for awk
to be really useful as a systems programming language.

In addition, experience from other popular languages has shown that
extensibility and namespaces are the keys to true power, usability,
and popularity.

With the release of gawk 4.1, we feel that gawk (and thus the Awk
language) are now almost on par with the basic capabilities of other
popular languages. With gawk 5.0, we hope(d) to truly reach par.

Is it too late in the game? In 2024, sadly, it does seem to be. But
at least I had fun adding the new features to gawk.

I hope that this paper will have piqued your curiosity, and that you
will take the time to give gawk a fresh look.

Appendix A Resources
====================
1. The AWK Programming Language Paperback, second edition,
   Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger.
   Addison-Wesley, 2023. ISBN-13: 978-0138269722,
   ISBN-10: 0138269726.
2. Effective awk Programming, fourth edition. Arnold Robbins.
   O'Reilly Media, 2015. ISBN-13: 978-1491904619,
   ISBN-10: 1491904615.
3. Online version of the gawk documentation:
   <https://www.gnu.org/software/gawk/manual/>
4. The gawkextlib project:
   <https://sourceforge.net/projects/gawkextlib/>

Appendix B Awk Code For du
==========================
Here ithe du program, written in Awk. Besides demonstrating the power
of the stat() and fts() extensions and gawk's multidimensional
arrays, it also shows the switch statement and the built-in bit
manipulation functions and(), or(), and compl().

The output is not identical to GNU du's, since filenames are not
sorted. However, gawk's built-in sorting facilities should make
sorting the output straightforward; we leave that as the traditional
"exercise for the reader."

#! /usr/local/bin/gawk -f

# du.awk --- write POSIX du utility in awk.
# See https://pubs.opengroup.org/onlinepubs/9699919799/utilities/du.html
#
# Most of the heavy lifting is done by the fts() function in the "filefuncs"
# extension.
#
# We think this conforms to POSIX, except for the default block size, which
# is set to 1024. Following GNU standards, set POSIXLY_CORRECT in the
# environment to force 512-byte blocks.
#
# Arnold Robbins
# arnold@skeeve.com

@include "getopt"
@load "filefuncs"

BEGIN {
    FALSE = 0
    TRUE = 1

    BLOCK_SIZE = 1024   # Sane default for the past 30 years
    if ("POSIXLY_CORRECT" in ENVIRON)
        BLOCK_SIZE = 512        # POSIX default

    compute_scale()

    fts_flags = FTS_PHYSICAL
    sum_only = FALSE
    all_files = FALSE

    while ((c = getopt(ARGC, ARGV, "aHkLsx")) != -1) {
        switch (c) {
        case "a":
            # report size of all files
            all_files = TRUE;
            break
        case "H":
            # follow symbolic links named on the command line
            fts_flags = or(fts_flags, FTS_COMFOLLOW)
            break
        case "k":
            BLOCK_SIZE = 1024       # 1K block size
            break
        case "L":
            # follow all symbolic links

            # fts_flags &= ~FTS_PHYSICAL
            fts_flags = and(fts_flags, compl(FTS_PHYSICAL))

            # fts_flags |= FTS_LOGICAL
            fts_flags = or(fts_flags, FTS_LOGICAL)
            break
        case "s":
            # do sums only
            sum_only = TRUE
            break
        case "x":
            # don't cross filesystems
            fts_flags = or(fts_flags, FTS_XDEV)
            break
        case "?":
        default:
            usage()
            break
        }
    }

    # if both -a and -s
    if (all_files && sum_only)
        usage()

    for (i = 0; i < Optind; i++)
        delete ARGV[i]

    if (Optind >= ARGC) {
        delete ARGV     # clear all, just to be safe
        ARGV[1] = "."   # default to current directory
    }

    fts(ARGV, fts_flags, filedata)  # all the magic happens here

    # now walk the trees
    if (sum_only)
        sum_walk(filedata)
    else if (all_files)
        all_walk(filedata)
    else
        top_walk(filedata)
}

# usage --- print a message and die

function usage()
{
    print "usage: du [-a|-s] [-kx] [-H|-L] [file] ..." > "/dev/stderr"
    exit 1
}

# compute_scale --- compute the scale factor for block size calculations

function compute_scale(     stat_info, blocksize)
{
    stat(".", stat_info)

    if (! ("devbsize" in stat_info)) {
        printf("du.awk: you must be using filefuncs extension from " \
            "gawk 4.1.1 or later\n") > "/dev/stderr"
        exit 1
    }

    # Use "devbsize", which is the units for the count of blocks
    # in "blocks".
    blocksize = stat_info["devbsize"]
    if (blocksize > BLOCK_SIZE)
        SCALE = blocksize / BLOCK_SIZE
    else    # I can't really imagine this would be true
        SCALE = BLOCK_SIZE / blocksize
}

# islinked --- return true if a file has been seen already

function islinked(stat_info,        device, inode, ret)
{
    device = stat_info["dev"]
    inode = stat_info["ino"]

    ret = ((device, inode) in Files_seen)

    return ret
}

# file_blocks --- return number of blocks if a file has not been seen yet

function file_blocks(stat_info,     device, inode)
{
    if (islinked(stat_info))
        return 0

    device = stat_info["dev"]
    inode = stat_info["ino"]

    Files_seen[device, inode]++

    return block_count(stat_info)   # delegate actual counting
}

# block_count --- return number of blocks from a stat() result array

function block_count(stat_info,     result)
{
    if ("blocks" in stat_info)
        result = int(stat_info["blocks"] / SCALE)
    else
        # otherwise round up from size
        result = int((stat_info["size"] + (BLOCK_SIZE - 1)) / BLOCK_SIZE)

    return result
}

# sum_dir --- data on a single directory

function sum_dir(directory, do_print,   i, sum, count)
{
    for (i in directory) {
        if ("." in directory[i]) {  # directory
            count = sum_dir(directory[i], do_print)
            count += file_blocks(directory[i]["."])
            if (do_print)
                printf("%d\t%s\n", count, directory[i]["."]["path"])
        } else {            # regular file
            count = file_blocks(directory[i]["stat"])
        }
        sum += count
    }

    return sum
}

# simple_walk --- summarize directories --- print info per parameter

function simple_walk(filedata, do_print,    i, sum, path)
{
    for (i in filedata) {
        if ("." in filedata[i]) {   # directory
            sum = sum_dir(filedata[i], do_print)
            path = filedata[i]["."]["path"]
        } else {            # regular file
            sum = file_blocks(filedata[i]["stat"])
            path = filedata[i]["path"]
        }
        printf("%d\t%s\n", sum, path)
    }
}

# sum_walk --- summarize directories ---
# print info only for the top set of directories

function sum_walk(filedata)
{
    simple_walk(filedata, FALSE)
}

# top_walk --- data on the main arguments only

function top_walk(filedata)
{
    simple_walk(filedata, TRUE)
}

# all_walk --- data on every file

function all_walk(filedata, i, sum, count)
{
    for (i in filedata) {
        if ("." in filedata[i]) {   # directory
            count = all_walk(filedata[i])
            sum += count
            printf("%s\t%s\n", count, filedata[i]["."]["path"])
        } else {            # regular file
            if (! islinked(filedata[i]["stat"])) {
                count = file_blocks(filedata[i]["stat"])
                sum += count
                if (i != ".")
                    printf("%d\t%s\n", count, filedata[i]["path"])
            }
        }
    }
    return sum
}

Footnotes
=========
[1]
The Amazingly Workable Formatter, awf, is available from
<ftp://ftp.freefriends.org/arnold/Awkstuff/awf.tgz>

[2]
See the Wikipedia article, and some notes at the late Dennis
Ritchie's website. There are undoubtedly other sources of information
as well.
<https://en.wikipedia.org/wiki/USL_v._BSDi>
<https://www.bell-labs.com/usr/dmr/www/bsdi/bsdisuit.html>

[3]
Especially for budget-strapped educational institutions, source
licences were increasingly an expensive luxury, since SVR4 rarely ran
on hardware that they had.

[4]
I've been told that one of the reasons Larry Wall created PERL is
that he either didn't know about new awk, or he couldn't get it.

[5]
For more information, see the gawkextlib project page.
<https://sourceforge.net/projects/gawkextlib/>

[6]
The awk version of du works on Unix, GNU/Linux, Mac OS X, and MS
Windows. On Windows only Cygwin is currently supported. We hope to
one day support MinGW also.

From: <http://www.skeeve.com/awk-sys-prog.html>

Date Sujet#  Auteur
18 Aug 24 * AWK As A Major Systems Programming Language36Ben Collver
18 Aug 24 +* Re: AWK As A Major Systems Programming Language10Stefan Ram
19 Aug 24 i+* Re: AWK As A Major Systems Programming Language7Computer Nerd Kev
19 Aug 24 ii`* Re: AWK As A Major Systems Programming Language6Richard Kettlewell
19 Aug 24 ii `* Re: AWK As A Major Systems Programming Language5Stefan Ram
19 Aug 24 ii  +* Re: AWK As A Major Systems Programming Language3Stefan Ram
19 Aug 24 ii  i+- Re: AWK As A Major Systems Programming Language1Richard Kettlewell
19 Aug 24 ii  i`- Re: AWK As A Major Systems Programming Language1Computer Nerd Kev
19 Aug 24 ii  `- Re: AWK As A Major Systems Programming Language1D
21 Aug 24 i`* Re: AWK As A Major Systems Programming Language2Anton Shepelev
22 Aug 24 i `- Re: AWK As A Major Systems Programming Language1Lawrence D'Oliveiro
19 Aug 24 `* Re: AWK As A Major Systems Programming Language25Lawrence D'Oliveiro
28 Aug 24  `* Re: AWK As A Major Systems Programming Language24Johanne Fairchild
28 Aug 24   +- Re: AWK As A Major Systems Programming Language1yeti
28 Aug 24   +* Re: AWK As A Major Systems Programming Language2Lawrence D'Oliveiro
28 Aug 24   i`- Re: AWK As A Major Systems Programming Language1Johanne Fairchild
28 Aug 24   `* Re: AWK As A Major Systems Programming Language20D
28 Aug 24    +- Re: AWK As A Major Systems Programming Language1Anton Shepelev
30 Aug 24    `* Re: AWK As A Major Systems Programming Language18Johanne Fairchild
31 Aug 24     `* Re: AWK As A Major Systems Programming Language17D
31 Aug 24      +* Re: AWK As A Major Systems Programming Language12Johanne Fairchild
31 Aug 24      i`* Re: AWK As A Major Systems Programming Language11D
2 Sep 24      i `* Re: AWK As A Major Systems Programming Language10Johanne Fairchild
2 Sep 24      i  +- Re: AWK As A Major Systems Programming Language1Lawrence D'Oliveiro
2 Sep 24      i  `* Re: AWK As A Major Systems Programming Language8D
2 Sep 24      i   `* Re: AWK As A Major Systems Programming Language7Johanne Fairchild
2 Sep 24      i    `* Re: AWK As A Major Systems Programming Language6D
2 Sep 24      i     `* Re: AWK As A Major Systems Programming Language5Johanne Fairchild
2 Sep 24      i      +* Re: AWK As A Major Systems Programming Language2Stefan Ram
3 Sep 24      i      i`- Re: AWK As A Major Systems Programming Language1D
2 Sep 24      i      +- Re: AWK As A Major Systems Programming Language1D
3 Sep 24      i      `- Re: AWK As A Major Systems Programming Language1yeti
3 Sep 24      `* Re: AWK As A Major Systems Programming Language4candycanearter07
3 Sep 24       +* Re: AWK As A Major Systems Programming Language2Lawrence D'Oliveiro
5 Sep 24       i`- Re: AWK As A Major Systems Programming Language1candycanearter07
3 Sep 24       `- Re: AWK As A Major Systems Programming Language1D

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal