The Use of the U+200D "ZERO WIDTH JOINER" (ZWJ) Character in reStructuredText Input for "Sphinx" (Technical Report SLRTR 000000)
(This technical report was prepared by the author during his spare time.)
Stefan Ram 2025
Abstract - The character U+200D "ZERO WIDTH JOINER" (ZWJ) may be employed in inputs written in the "reStructuredText" (rst) markup notation for the software documentation tool "Sphinx" in order to permit the inclusion of special characters within embedded code segments. To ensure that Sphinx's automatic line breaking continues to function correctly, two minor adjustments to Sphinx are required.
I. Introduction
The software documentation tool "Sphinx" accepts texts composed in the "reStructuredText" (rst) notation. Within paragraphs, code segments are denoted by enclosing the relevant text between pairs of grave accents (``) as illustrated in Figure 1.
Figure 1: A Code Segment within a Paragraph
|... the expression ``x[ 2 ]`` may be used ...
Such segments are, however, subject to two restrictions: - They must not begin or end with a space character (" "). - They must not contain pairs of grave accents.
II. Versions of the Software Considered
This report pertains to Sphinx, version 8.2.3.
III. The U+200D ZERO WIDTH JOINER (ZWJ) Character as a Workaround
It is nevertheless possible to include a space at the beginning of an embedded code segment by prefixing it with the invisible character U+200D "ZERO WIDTH JOINER" (ZWJ). Similarly, a space may be appended to the end of such a segment by suffixing it with a ZWJ. Furthermore, a sequence of multiple grave accents within an embedded code segment can be achieved by interposing a ZWJ between the grave accents.
The ZWJ character is invisible in Sphinx's output, or it may be removed by means of post-processing if so desired.
IV. Consideration of ZWJ in Line Breaking and Word Division
Sphinx interprets a ZWJ as a character of width one and regards it as a potential break point within words. Consequently, the formatting of output text may be affected. This behavior can be modified by two changes to the Sphinx source code.
A. Adjustment of Character Width
Within the Sphinx source file "docutils\utils\__init__.py", the width of ZWJ characters should be subtracted from the total text width, so that ZWJ is not counted as a character of length one. This is accomplished by inserting the following line prior to the "return width" statement in the definition of the column_width function:
Figure 2: The line to be inserted
|width -= text.count('\u200d')
B. Adjustment of Break Point Determination
(This adjustment is likely unnecessary for ZWJ within embedded code segments, but may be required if ZWJ is used within words of running text for any reason.)
In the Sphinx source file "sphinx\writers\text.py", words should not be split at the occurrence of ZWJ within a word. To this end, the definition shown in Figure 2 may be inserted below the definition of the split function (which itself is within the definition of the _split function in the TextWrapper class). The indentation of the new col_width function should match that of the preceding split function.
Figure 3: The definition to be inserted
|def col_width(t: str) -> int: | '''for the purpose of word splitting, treat | zero-width characters just as characters | of width one.''' | width = column_width(t) | if width == 0: width = 1 | return width
The source code should further be modified such that this new col_width function is invoked in the call to "groupby" three lines below, replacing the previous use of column_width.