Gettext Generalized

Chusslove Illich

Revision History
2011-08-10
First draft.
2013-05-17
Strict split between XML-like and plain text generalized Gettext calls.
2015-02-08
Section on other approaches to translation scripting.

Abstract

GNU Gettext is a translation system for user interface text in computer programs. It is the most used translation system in free software. It encompasses the runtime library interface, the message extraction system, the translation file format, and the workflow and tools which connect these elements. This document proposes a number of extensions to Gettext which would generalize and broaden its capabilities, for both programmers and translators. No backward incompatible changes are required, and generalized Gettext calls can be used alongside classic calls even in the same source file. The extensions are as follows: argument capturing, uniform placeholders, customizable markup, interface contexts, and translation scripting. For each extension it is explained how it works and what it improves on its own, and how it fits together with other extensions.


Table of Contents

1. What More Could Gettext Do?
1.1. A Quick Taste of Generalization
2. Argument Capturing
2.1. Implicit vs. Explicit Conversion to Native String
2.2. Generalized Plural Call
2.3. Other Special Calls
3. Uniform Placeholders
3.1. Selection of Plural-Deciding Argument
3.2. Formatting Directives vs. Placeholders
4. Customizable Markup
4.1. Tag Set and Target Markups
4.2. No Custom Entities
4.3. Validating Markup
4.4. Accelerator Markers
4.5. Customizable Markup is Optional
5. Interface Contexts
5.1. Deciding Target Markup by Interface Context
6. Translation Scripting
6.1. Translation Scripts in Messages ("PO Shell")
6.2. Why Keep Ordinary Translation?
6.3. Sources of Scripting Calls
6.4. Translation Object Model
6.5. Modularity and Scoping
6.6. Filtering Messages and Runtime Contexts
6.7. Setting Properties on Non-Native Messages
7. Remarks
7.1. Converting Existing Sources
7.2. Generalization As a Layer Above Gettext
7.3. Translating Non-Native Sources
7.4. History and Acknowledgments
7.5. Other Approaches to Translation Scripting

1. What More Could Gettext Do?

GNU Gettext is today the most capable system for translating user interface text in computer programs. It defines a programming interface for fetching translated pieces of text ("messages") at run time, from installed translation files ("catalogs") in various languages. It provides tools to extract messages to be translated from the source code, and to update previously translated catalogs to the current code state. It defines the translation file format (the PO format) with which translators directly work. It provides workflow guidelines and build system integration, so that programmers and translators can coordinate their efforts. In short, GNU Gettext is an all-encompassing solution, covering all elements of the translation chain.

GNU Gettext is today by far the most widely used translation system in free software world. It was introduced in mid-90s, based on the then-actual efforts to standardize the translation system component in POSIX. This is described in more detail in the "History of GNU gettext" section (and few other places) of its manual. Standardization never came about, and GNU Gettext was left to evolve freely, based on experiences and needs of the free software ecosystem. The core system remained unchanged and backward compatible, and improvements were incrementally added: to the runtime calls, to the file format, to the accompanying tools. Great many third-party tools, both for interactive and batch use, were written along the way.

This text will propose some principal extensions to Gettext, which go beyond incremental improvements and detail tweaks. What has changed in translation needs, for example during the last decade, to lead to proposing these extensions? In fact, nothing: all the issues that will be addressed have been present ten years ago just as they are now. However, what back then could have been considered technically heavyweight -- in performance, memory footprint, dependencies -- for a runtime translation system, should not be today. This opens the door for practical realization.

For better or for worse, the proposed extensions are not entirely mutually independent. Together they form a self-consistent generalization of Gettext, such that the use of Gettext across coding frameworks (programming languages and sub-language environments) becomes more uniform and more powerful, for both programmers and translators.

Where necessary, coding snippets will be given in C++ and Python, as two (for present purpose) opposite ends of the spectrum of widely-known programming languages. Conspicuously excluded is C, because the proposed extensions have some object-orientation, and I am not quite at home at adapting such things to C.[1]

Fully spelled Gettext calls will be used throughout for clarity. Possibility to use ad-hoc shortcut calls in particular programs, libraries, or environments (such as _, tr, i18n) remains, being orthogonal to the proposed extensions. For brevity, comments for translators and disambiguation contexts will be omitted from some example messages which should have them in real use.

1.1. A Quick Taste of Generalization

Imagine a program which provides a "fuzzy" clock. A fuzzy clock displays the time in prose and rounded, like "four o'clock", "quarter to nine", or "five past midnight". This program has a graphical display on the desktop background or in a desktop panel. The day is divided into several periods, and the current period is shown alongside the time, like "quarter to nine (evening)". Additionally, the period text is shown in a dimmer font, like grayish as opposed to black for time.

We have two implementations of this fuzzy clock. One is written in C++ as the programming language and Qt as the UI toolkit, and the other in Python and GTK+. Translation-related snippets in the C++/Qt code[2], using current Gettext interface, would look like this:

// C++/Qt

#define gettext(text) QString::fromUtf8(gettext(text))
// ...to make gettext call immediately result in Qt's QString object.

QString getPeriodString (int hour)
{
    if (hour > 10 && hour <= 4)
        return gettext("night");
    else if (hour > 4 && hour <= 9)
        return gettext("morning");
    ...
}

QString getHourString (int hour)
{
    switch (hour) {
    case 1: return gettext("one");
    case 2: return gettext("two");
    ...
    }
}

QString getTimeString (int hour, int minute)
{
    QString hourString = getHourString(minute <= 32 ? hour : hour + 1);
    if (minute <= 2)
        return gettext("%1 o'clock").arg(hourString);
    ...
    else if (minute > 27 && minute <= 32)
        return gettext("half past %1").arg(hourString);
    ...
    else if (minute > 42 && minute <= 47)
        return gettext("quarter to %1").arg(hourString);
    ...
}

QString getDisplayString (int hour, int minute)
{
    QString timeString = getTimeString(hour, minute);
    QString periodString = getPeriodString(hour);
    return gettext("%1 (<font color='gray'>%2</font>)")
                  .arg(timeString).arg(periodString);
}

The string which puts together the time and period substrings has to be wrapped for translation too, because some languages may needed different spacing and ordering of parts. In Python/GTK+, the implementation would be:

# Python/GTK+

def get_period_string (hour):

    if hour > 10 and hour <= 4:
        return gettext("night")
    elif hour > 4 and hour <= 9:
        return gettext("morning")
    ...


def get_hour_string (hour):

    if hour == 1:
        return gettext("one")
    elif hour == 2:
        return gettext("two")
    ...


def get_time_string (hour, minute):

    hour_string = get_hour_string(hour if minute <= 32 else hour + 1)
    if minute <= 2:
        return gettext("%s o'clock") % hour_string
    ...
    elif minute > 27 and minute <= 32:
        return gettext("half past %s") % hour_string
    ...
    elif minute > 42 and minute <= 47:
        return gettext("quarter to %s") % hour_string
    ...


def get_display_string (hour, minute):

    time_string = get_time_string(hour, minute)
    period_string = get_period_string(hour)
    return gettext("%(a1)s (<span foreground='gray'>%(a2)s</span>)") \
                  % dict(a1=time_string, a2=period_string)

The last string uses named formatting directives (%(a1)s, %(a2)s) to enable the translator to reorder them if necessary[3]; in the C++/Qt version above, this was implicitly possible.

Let us now extract a few exemplary translation-related lines from the two versions:

// C++/Qt
gettext("one")
gettext("half past %1").arg(hourString)
gettext("%1 (<font color='gray'>%2</font>)")
       .arg(timeString).arg(periodString)

and:

# Python/GTK+
gettext("one")
gettext("half past %s") % hour_string
gettext("%(a1)s (<span foreground='gray'>%(a2)s<s/pan>)") \
       % dict(a1=time_string, a2=period_string)

The first observation to make is that, from the translator's viewpoint, there is no real difference between these two sets of strings, as the user-visible output will be the same. But the strings are technically different because the underlying coding framework shows through them. This difference is mere hindrance, as it adds nothing to the efficiency or quality of translation. In fact, it detracts: sometimes in translation the technical parts of the message (as opposed to user-visible text) should be changed as well, and profusion of formats will cause the translator to hesitate or err.

The second observation is that this code is, in fact, non-translatable. What is "half past four" in English, will be the equivalent of "half to five" in many other languages; whereas it is always "number o'clock" in English, in many languages the equivalent of "o'clock" will have to change according to the number; and so on.[4] This forces the translator into the notorious "least bad" translation.

The third observation is that if the programmer changes something in the visual aspects of the markup, for example the display color for a certain type of entity (such as for the day period above), all previously translated messages become fuzzy in PO files. This not only leads to unnecessary effort for all translators, but a translator may also fail to spot the change[5] and unfuzzy the message without properly adjusting the translation. This problem also applies to extended formatting directives such as %.1f or {num:05d}.

Now we convert the fuzzy clock code to use generalized Gettext calls, i.e. to make use of Gettext extensions which will be proposed.

The basic generalized call is named xgettext, where x* stands for "XML-like". Here is how it would be employed in the new C++/Qt version:

// C++/Qt
xgettext("one")
xgettext("half past {hour}").subs("hour", hourString)
xgettext("{time} (<dimmed>{period}</dimmed>)")
        .subs("time", timeString).subs("period", periodString)

and in the new Python/GTK+ version:

# Python/GTK+
xgettext("one")
xgettext("half past {hour}", hour=hour_string)
xgettext("{time} (<dimmed>{period}</dimmed>)",
         time=time_string, period=period_string)

For translators, there is no longer any difference between the two sets of strings. Coding framework specific formatting directives have been replaced with uniform placeholders. Visual markup has been replaced with customizable markup, which enables the programmer to change the resulting visual appearance without perturbing the string seen by the translator. The xgettext function does not return a simple string, but a special object which captures the arguments and substitutes them internally after the translation has been fetched from the catalog. This enables translators to script the translation, such that it depends on the arguments supplied at run time.

In the PO file of the C++/Qt version, the translation looks like this:

#: fuzzyclock.cpp:87
#, ggx-format
msgid "morning"
msgstr "jutro"

...

#: fuzzyclock.cpp:123
#, ggx-format
msgid "one"
msgstr "jedan"
msgscr "$[set-property next-hour dva]"

#: fuzzyclock.cpp:124
#, ggx-format
msgid "two"
msgstr "dva"
msgscr "$[set-property next-hour tri]"

...

#: fuzzyclock.cpp:256
#, ggx-format
msgid "half past {hour}"
msgstr "{hour} i trideset"
msgscr "pola $[get-property next-hour {hour}]"

...

#: fuzzyclock.cpp:312
#, ggx-format
msgid "{time} (<dimmed>{period}</dimmed>)"
msgstr "{time} (<dimmed>{period}</dimmed>)"

The PO file of the Python/GTK+ version would look exactly the same, only the source references would show fuzzyclock.py:... instead. Hence, the translator does not feel the underlying coding framework at all. Messages have a new format flag[6], ggx-format, which shows that these are generalized Gettext strings with XML-like markup instead of coding framework specific.

What did the translator do here? The first message is quite ordinary. The second message, which starts the hours list, has a translation script in addition to the plain translation. The translator added the script by adding the new msgscr field with the script. Without going into syntax, this script sets the next-hour property with the value "dva" ("two") on the string "jedan" ("one") given by msgstr; the plain meaning being "the next hour after one is two". The hour "dva" ("two") has next-hour set to "tri" ("three"), and so on. Then come the time composition messages. The composition "half past hour" is translated in msgstr as "hour i trideset" ("hour and thirty"), but the user would rather expect the formulation "pola hour+1" ("half hour+1"). The translator now adds a script which overrides the msgstr translation, unless it fails for some reason. This script fetches the value of the next-hour property of the {hour} argument once it is substituted, where the property was set through the previously fetched corresponding hour message, and inserts the value into the surrounding text. The result is the desired "pola hour+1" translation.

The rest of this document describes in detail the features demonstrated in this example, hopefully covering many of the questions that may have been raised.

2. Argument Capturing

Argument capturing translation calls are not a feature as such, but a necessary foundation for all the actual features introduced by other proposed extensions. For this reason it is important to cover argument capturing first. However, some consequences of argument capturing will be omitted in this section, and covered in other sections where they apply directly.

Current Gettext calls are a "thin" wrapper around coding framework's native strings. They take a native string, find (or not) its translation in the message catalog, and return the translated (or original) native string. This makes it possible to quickly prepare for translation a piece of code which was not made translatable from the start. For example:

// C++/Qt
QString msg = QString("File '%1' is missing.").arg(filePath);
QErrorMessage *dlg = new QErrorMessage();
dlg->showMessage(msg);
# Python/GTK+
msg = "File '%s' is missing." % file_path
dlg = gtk.MessageDialog(None, gtk.DIALOG_MODAL, gtk.MESSAGE_ERROR,
                        gtk.BUTTONS_CLOSE, msg)

can at a later point be straightforwardly modified to:

// C++/Qt
QString msg = gettext("File '%1' is missing.").arg(filePath);
QErrorMessage *dlg = new QErrorMessage();
dlg->showMessage(msg);
# Python/GTK+
msg = gettext("File '%s' is missing.") % file_path
dlg = gtk.MessageDialog(None, gtk.DIALOG_MODAL, gtk.MESSAGE_ERROR,
                        gtk.BUTTONS_CLOSE, msg)

The important thing to note is that, after this transformation, arguments and argument substitution remain in the scope of the coding framework. Other than wrapping the original strings with Gettext calls, nothing else needs to be changed in the code.

With generalized Gettext, arguments are captured and their substitution into the string is done by the translation system. The native string delivered at the end is a complete message, requiring no further intervention. Here is how the most verbose variant of this would look like:

// C++/Qt
Gettext::Translator gtxmsg = xgettext("File '{path}' is missing.")
                                     .subs("path", filePath);
QString msg = gtxmsg.toString();
QErrorMessage *dlg = new QErrorMessage();
dlg->showMessage(msg);
# Python/GTK+
gtxmsg = xgettext("File '{path}' is missing.", path=file_path)
    # ...of type xgettext.Translator
msg = gtxmsg.to_string();
dlg = gtk.MessageDialog(None, gtk.DIALOG_MODAL, gtk.MESSAGE_ERROR,
                        gtk.BUTTONS_CLOSE, msg)

The xgettext call returns an object of special type, defined in the "generalized Gettext library" (precise description follows shortly). In general, this object has methods for substituting arguments one by one (C++ example), although the programming language may allow shorthand syntax where some or all the arguments are substituted within the xgettext call itself (Python example). Finally, to be usable at the destination -- a GUI widget label, shell output, log file -- this object has methods for conversion to a native string.

The translation object is exposed to the client code not only to be able to call argument substitution methods on it, but also in order to be able to defer argument substitution:

// C++/Qt
class ProgressReporter
{
    ...
    /**
     * ...
     * @param msgBase ... must have {file} placeholder ...
     * ...
     */
    ProgressReporter (..., const Gettext::Translator &msgBase);
    ...
    Gettext::Translator m_msgBase;
    ...
};


ProgressReporter::update ()
{
    ...
    QString msg = m_msgBase.subs("file", currentFilePath).toString();
    ...
}

void validateFilesOp (...)
{
    ...
    Gettext::Translator msgBase = xgettext("Validating '{file}'...");
    ProgresReporter *reporter = new ProgresReporter(..., msgBase);
    ...
}

Since the need to defer argument substitution is universal, even the coding frameworks capable of variadic-with-keyword calls (like Python or Common Lisp) would still have .subs methods available:

# Python
msg_base = xgettext("Processing {file}, {ratio}% complete...",
                    file=current_file_path)
...
rcomplete = float(current_line) / total_lines
msg = msg_base.subs("ratio", int(round(rcomplete * 100))).to_string()

The actual translation (catalog lookup) and argument substitution would happen at the last possible moment, when the conversion method to native string is called. At this point all of the arguments should have been provided to the translation object and stored[7], and they are formatted into strings and substituted into the translation fetched from the catalog. One convenience of this lazy resolution can be recognized immediately: the code can start making translation objects before Gettext has been initialized[8], instead of resorting to the technique of noop-wrapper with deferred translation call. For example:

# Python
# -*- coding: utf8 -*-

from xgettext import bindtextdomain
from xgettext import xgettext

g_card_ranks = {
    "king": pxgettext("card rank", "King"), # context call
    "queen": pxgettext("card rank", "Queen"),
    "jack": pxgettext("card rank", "Jack"),
    ...
}

def main ():

    ...
    bindtextdomain("pycardlib", locale_dir);
    ...
    for card_rank in g_card_ranks.values():
        print(card_rank.to_string())


if __name__ == "__main__":
    main()

One reason for why lazy resolution is a necessity rather than a convenience, is to avoid recursive argument substitution, in case when the substituted argument has something that looks like a placeholder in it. Lazy resolution is also crucial for translation scripting, as will be shown later.

It is clear that the "generalized Gettext library" has to provide per-programming language support, for appropriate *xgettext calls and translation objects. However, due to the conversion methods which return native strings, and due to argument substitution methods which can take native data types, there also has to exist specialized support for different frameworks within the same programming language:

// C++/Qt
include <libintl_qt.h>
...
QDate date;
...
QString msg = xgettext("Events on {date}:").subs("date", date).toString();
// C++/GTK+
include <libintl_glibmm.h>
...
Glib::DateTime date;
...
Glib::ustring = xgettext("Events on {date}:").subs("date", date).to_string();

This generalized Gettext library does not have to be an actual single software package, i.e. a part of Gettext distribution. Instead, Gettext distribution could contain only a single implementation (for C), and let particular coding frameworks (languages, foundation libraries) create their own bindings.

One particular substitutable argument type would be translation objects themselves. Instead of doing:

// C++/Qt
QString profileName;
...
if (...)
    profileName = xgettext("Power Saving").toString();
...
QString notification = xgettext("{profile} profile activated.")
                               .subs("profile", profileName).toString();

the programmer should do:

// C++/Qt
Gettext::Translator profileName;
...
if (...)
    profileName = xgettext("Power Saving");
...
QString notification = xgettext("{profile} profile activated.")
                               .subs("profile", profileName).toString();

Lazy resolution (translation and argument substitution) would mean here that profileName gets converted into a native string during the call to .toString of the notification message, when the time comes to substitute it in place of {profile}. This again is not just a convenience, but a crucial point when working with customizable markup.

How should the translation class Gettext::Translator be placed in the object hierarchy of the coding framework? In general, it should interact with the native object hierarchy only through its substitution and conversion methods. In particular, it should not derive from framework's native string class, because it is in fact not a string at all, but a translated text composer.

Previous formulation suggests introducing a more basic text composer class, called Gettext::Composer, from which Gettext::Translator would derive. Gettext::Composer would actually do everything related to argument substitution and markup resolution, while Gettext::Translator would be tasked with fetching the translation and evaluating translation scripts. Where possible, Gettext::Translator should not be directly constructable (e.g. private constructors in C++), but only through *xgettext functions. On the other hand, Gettext::Composer would be directly constructable, enabling its use in, for example, structuring of longer texts:

// C++
Gettext::Translator title = xgettext("...");
Gettext::Translator para1 = xgettext("...");
Gettext::Translator para2 = xgettext("...");
Gettext::Composer helpText("{t}\n\n{p1}\n\n{p2}\n");
helpText = helpText.subs("t", title).subs("p1", para1).subs("p2", para2);

It would not be safe to use Gettext::Translator in this way, because it would try to translate its own text, which could lead (even if unlikely) to unexpectedly fetching a translation from the catalog. Explicit use of Gettext::Composer comes even more into play when customizable markup is considered.

2.1. Implicit vs. Explicit Conversion to Native String

For the moment putting aside the verbosity due to keyword-like argument placeholders (the section about placeholders will discuss that), the only real inconvenience in the proposed generalized Gettext interface is the need to explicitly call a conversion method to get the resolved (translated and arguments substituted) native string. There are several possibilities to get around this.

The ideal world variant is that different destinations (GUI widgets, shell writers...) are aware of Gettext::Composer; that is, that Gettext::Composer can be an argument wherever the native string can be. This would be that more important when customizable markup is taken into account, because, in principle, it could resolve into different destination markups. For example, a "<emph>...</emph>" segment would resolve into *...* for plain text destinations, and into "<i>...</i>" for HTML-like destinations (e.g. Qt Rich Text or GTK+ Pango Markup tooltips and whatsthis texts). This is presented in much more detail in the section on markup. Since the destination knows which type of markup it can process, it would internally call the appropriate conversion method of Gettext::Composer.

In absence of aware destinations, the second thing that comes to mind is implicit conversion, through e.g. conversion operators in C++ or duck typing in Python. Keeping in mind the previous paragraph, you may observe that one problem with implicit conversion is how to specify the target markup; but this will be discussed in the section on markup. A more pressing problem are inadvertent conversions. Consider this:

// C++/Qt
QStringList quips;
...
quips.append(xgettext("It's full of <emph>what</emph>?!"));
    // ...implicit conversion kicks in, leading to early resolution;
    // <emph> has to be conservatively converted to plain text.
...
QString note = xgettext("A new quip has just been uttered: {quip}")
                       .subs("quip", quips[quipIndex])
                       .toRichString();
    // ...the final message should be rich text, but all substitutes
    // already got converted to plain text earlier.

There can be several layers of code between the composite message and the translated substitutions, so if the problem is detected long after the code has been written, it may be difficult to refactor it.

There can also be a middle ground. It follows from two observations: most of the time deferred argument substitution is not necessary (e.g. a text label is directly written into or precedes a widget constructor call), and most of the arguments are either not translations themselves or are short texts without need for markup (e.g. names of things). There could be a parallel set of immediate-resolution calls, which would take all the arguments at once, call a conversion method internally, and return a native string. Both of the following two lines would set msg to a Python string object:

# Python
msg1 = xgettext("Processing {file}...", file=a_path).to_string()
msg2 = ixgettext("Processing {file}...", file=a_path)

The i* in ixgettext stands for "immediate".

In current C++, an ixgettext call that could both work with keyword placeholders and require no conversion call would be longer than xgettext itself, defeating the purpose. However, a stripped-down variant is still possible, by limiting oneself to ordinal placeholders:

// C++/Qt
QString msg1 = xgettext("Processing {1}...").subs(aPath).toString();
QString msg2 = ixgettext("Processing {1}...", aPath);

Ordinal placeholders index rather than name arguments, and can be used alongside keyword placeholders; the section on placeholders provides more details. ixgettext can now be implemented as a set of templates[9] like this:

// C++/Qt

inline QString ixgettext (const char *text)
{
    return xgettext(text).toString();
}

template <typename A1>
inline QString ixgettext (const char *text, const A1 &a1)
{
    return xgettext(text).subs(a1).toString();
}

template <typename A1, typename A2>
inline QString ixgettext (const char *text, const A1 &a1, const A1 &a2)
{
    return xgettext(text).subs(a1).subs(a2).toString();
}

...

This puts an arbitrary limit on number of arguments, but 9 is a nice number: it includes all single-digit placeholders, and the number of messages with more than 9 arguments is insignificant[10] in practice. Also, the C++11 standard introduces variadic templates, which removes the limit as well as the verbosity of definition.

It depends on programmer's sensibility which of these three variants -- explicit conversion, implicit conversion, immediate-resolution -- is preferable. Some programmers may not mind using verbose explicit conversion everywhere, to keep things uniform and having least surprise. Others may opt for the combination of immediate-resolution calls most of the time and explicit conversion when necessary. Yet some will have no grudge against implicit conversion (especially when they do not use any markup in text). Therefore the generalized Gettext library may very well provide all three variants (with implicit conversion that can be disabled), subject to applicability in different programming languages.

2.2. Generalized Plural Call

In current Gettext, the translation call for plural forms ngettext takes an explicit integer argument to decide the plural form. Normally this number is also shown in the string, so it needs to be repeated twice in the logical composition of the call:

# Python
msg = ngettext("%d file selected", "%d files selected", nselfiles) \
              % nselfiles

This is necessary because ordinary Gettext calls have no knowledge of arguments which will later be substituted. With argument capturing, however, this is no longer the case, and the generalized plural call nxgettext can look like this:

# Python
msg = nxgettext("{numsel} file selected", "{numsel} files selected",
                numsel=nselfiles)

At first sight, this may look like a small syntactic gain for no real benefit. And it may even look like a trouble ahead: what happens when the message has several arguments and more than one of them is an integer? Like in this example:

# Python
msg = ngettext("%(numsel)d of %(numtot)d file selected"
               "%(numsel)d of %(numtot)d files selected", nselfiles) \
              % dict(numsel=nselfiles, numtot=ntotfiles)

In fact, this is where the generalized approach shines. The programmer can always explicitly set which argument decides the plural by adding the !n extension to the appropriate placeholder:

# Python
msg = nxgettext("{numsel} of {numtot!n} file selected",
                "{numsel} of {numtot!n} files selected",
                numsel=nselfiles, numtot=ntotfiles)

This means that the translator can change which argument decides the plural, and that is exactly what could be needed for some languages, as in this example:

#: file_selector.py:429
#, ggx-format
msgid "{numsel} of {numtot!n} file selected"
msgid_plural "{numsel} of {numtot!n} files selected"
msgstr[0] "Izabrana {numsel!n} datoteka od {numtot}"
msgstr[1] "Izabrane {numsel!n} datoteke od {numtot}"
msgstr[2] "Izabrano {numsel!n} datoteka od {numtot}"

Precise rules for plural resolution based on the placeholders and argument types are given in the section on placeholders.

Another benefit derives from lazy translation resolution, which means that deferred plural resolution is possible:

# Python

def process_files (...):
    ...
    msgbase = nxgettext("Processing, {num} files remaining...")
    prgup = ProgressUpdater(..., msgbase)
    ...

class ProgressUpdater:

    def __init__ (..., msgbase):
        """
        ...
        @param msgbase: ... must be plural with {num} placeholder...
        ...
        """
        self._msgbase = msgbase
        ...

    def update (self):
        ...
        msg = self._msgbase.subs("num", len(remaining_items)).to_string()
        ...

In practice, in some plural messages the singular string does not contain the plural-deciding argument, for stylistic reasons. This would still be supported:

# Python
msg = nxgettext("Move this file to {dest}?",
                "Move these {num} files to {dest}?",
                num=nfiles, dest=dirpath)

The difference is that now the translation call knows which placeholder is linked to the argument that decides the plural, so it can complain at run time (e.g. in debug mode) if there is any other difference in placeholder sets between the singular and the plural string.[11]

2.3. Other Special Calls

The disambiguation context call pgettext remains the same in the generalized variant:

// C++
pxgettext("default action", "Default")
pxgettext("default shortcut", "Default")

The context string, however, may serve some additional purposes to its basic disambiguation purpose, as will be described in the section on interface contexts in relation to customized markup.

The explicit domain selection call dgettext, which is for example necessary in libraries, also remains the same:

// C++
dxgettext("pycardlib", "Available decks:")

In practice, explicit domain selection calls are wrapped in a macro definition, in order not to have to repeat the domain name argument in every call:

// C++
#define _(msg) dxgettext("pycardlib", msg) // in a header file
....
_("Available decks:") // in a source file

This still works with generalized calls, but since generalized calls will have specialized bindings for each coding framework, it would be nice to always define a domain selection mechanism which feels "the most native" to that coding framework.

3. Uniform Placeholders

In current Gettext, formatting directives in strings are always specific to the coding framework:

// C++/cstdlib
snprintf(buf, size, gettext("Stopped at %d:%d due to: %s"),
         line_num, column_num, message);
// C++/Qt
QString(gettext("Stopped at %1:%2 due to: %3"))
       .arg(lineNum).arg(columnNum).arg(message);
// C++/Boost
format(gettext("Stopped at %1%:%2% due to: %3%"))
      % lineNum % columnNum % message;
# Python %-operator
gettext("Stopped at %(lin)d:%(col)d due to: %(msg)s") \
       % dict(lin=line_num, col=column_num, msg=message)
# Python .format()
gettext("Stopped at {lin}:{col} due to: {msg}")
       .format(lin=line_num, col=column_num, msg=message)

From programmers' viewpoint, this may be considered beneficial, since there is no difference in handling substitution of arguments into translated strings and any other strings, within the given piece of code. It is also beneficial when translation calls are added at a later point, if the code was not prepared for translation from the start.

Translators, however, frequently work on messages originating from different coding frameworks, so they are faced with multiplicity of formatting directive types. From translators' viewpoint, this multiplicity has no advantage, and can only be confusing; note, for example, the small difference between the Qt and Boost directives above. It is also problematic for translation support tools, which frequently must recognize formatting directives for proper operation (e.g. for syntax highlighting, spell-checking, word counting, or placeables detection). Even Gettext's own tools, which can recognize (xgettext) and validate (msgfmt) many formatting directive types, do not recognize all of them[12], and have to be periodically updated in that respect.

The proposition is therefore to introduce uniform argument placeholders, which remain constant in any coding framework. One rationale for this is convenience for translators and translation support tools. It should not be a significant inconvenience for programmers, and in fact, they may be helpful as an overt indicator of special rules which apply when writing translatable strings as opposed to any other strings.

Another rationale for having uniform placeholders is that they are actually a consequence of the requirement that translation calls/objects themselves perform argument capturing and substitution. But this looks like circular reasoning, since it was said before that argument capturing was necessary for other proposed extensions. In fact, it is the customizable markup and translation scripting extensions which require argument capturing, and then argument capturing in turn requires uniform placeholders.

How should such uniform placeholders look like?

First and foremost, placeholders should be obvious, for both programmers and translators. It should be easy to spot where a placeholders starts and where it ends, no matter what it contains in between. To this end, curly bracket placeholders from Python's string.format (in turn taken over from .NET) are an excellent candidate:

"Stopped at {lin}:{col} due to: {msg}"

A placeholder starts and ends with a pair of balanced characters, and that they are curly brackets (braces) immediately indicates that something special is inside. At first sight, square brackets look like another good candidate, but search through several large bodies of translation (Fedora, Gnome, KDE, Mozilla, OpenOffice) reveals that literal square brackets in strings are about four times as frequent as literal braces.

When a literal brace is needed in the string, it would be escaped by repeating it twice. This holds for both opening and closing brace, even when there is no opening brace before the closing brace. This leads to a bit more escaping than technically necessary, but makes up by being simple for both people (programmers and translators) and external parsers (translation support tools).

The other requirement is that placeholders by themselves provide some context about arguments to translators. This leads to keyword placeholders. For example, these messages:

"Notification from {appname}"
"Allow access to {service} by {username}?"

already provide sufficient context for the translator, whereas these:

"Notification from %1"
"Allow access to %1 by %2?"

would require that the programmer adds comments describing what the arguments are[13], or that the translator looks them up in sources and tries to make that out.

Keyword placeholders may be all nice for translators, but some programmers may feel them as too verbose to use all the time. Especially if the coding framework does not allow short syntax for that. For example C++'s method syntax:

xgettext("Allow access to {service} by {username}?")
        .subs("service", checkService).subs("username", currentUser);

as compared to Python's variadic-keyword syntax:

xgettext("Allow access to {service} by {username}?",
         service=checkService, username=currentUser)

To alleviate the verbosity, ordinal placeholders could be used too, which in C++ would result in:

xgettext("Allow access to {1} by {2}?")
        .subs(checkService).subs(currentUser);

or, with the template play described in the section on argument capturing, even in:

ixgettext("Allow access to {1} by {2}?", checkService, currentUser);

Ordinals must start from 1 and there must be no gaps in the sequence up to the highest ordinal used in the string, or else an error is signaled.

Why should ordinals start from 1 instead from 0? One reason is traditional, since this is how it is in long time users of ordinal placeholders, such as Qt and Boost; though Python's more recent string.format starts counting from 0. Another reason has to do with translation scripting: the {0} placeholder would be automatically available in the scripted part of translation (msgscr field), where it would denote the full text of the message. More about this in the section on translation scripting.

Ordinal placeholders unfortunately throw us back to non-descriptive argument substitutions for translators, which then require explicit comments or source lookups. Fortunately, this can be easily fixed by allowing to name the argument using the ~name extension:

xgettext("Allow access to {1~service} by {2~username}?",
         checkService, currentUser);

Unlike in a keyword placeholder, this name has no technical function, only descriptive. It can even be omitted in translation:

#, ggx-format
msgid "Allow access to {1~service} by {2~username}?"
msgstr "Dozvoliti pristup za servis {1} korisniku {2}?"

Finally, if the programmer is of the really lazy sort (even for programmers), she can choose to use empty placeholders:

xgettext("Allow access to {} by {}?", checkService, currentUser);

When arguments are substituted, empty placeholders automatically get assigned numbers as if they were ordinal placeholders. This is important because it allows translators to treat empty placeholders as ordinal, when different ordering is needed in translation:

#, ggx-format
msgid "Allow access to {} by {}?"
msgstr "Sme li {2} da pristupi {1}?"

Empty placeholder too can use the naming extension, that is:

xgettext("Allow access to {~service} by {~username}?",
         checkService, currentUser)

Admittedly, there is a danger here that the translator may mistake named empty placeholders for keyword placeholders, and try to reorder them thus:

#, ggx-format
msgid "Allow access to {~service} by {~username}?"
msgstr "Sme li {~username} da pristupi {~service}?"

which wouldn't work since names have no technical function. The PO file compilation tool (msgfmt) could help here by warning of such use, and instruct to add explicit ordinals to cancel the warning.

There is no problem with allowing any combination of keyword, ordinal and empty placeholders to appear in a single message. This may even be a reasonable style in some cases, such as if the programmer decided to use keyword placeholders only in rare cases where translators need context on arguments, and ordinal placeholders in majority of messages. Keyword placeholders and ordinal placeholders, or keyword placeholders and empty placeholders, mix obviously:

// C++
xgettext("Operation {1} on file {2} aborted with failure ({errcode}).")
        .subs(actionName).subs(filePath).subs("errcode", exitCode)
xgettext("Operation {} on file {} aborted with failure ({errcode}).")
        .subs(actionName).subs(filePath).subs("errcode", exitCode)

If ordinal and empty placeholders are mixed:

// C++
xgettext("Operation {} on file {} aborted with failure ({1}).")
        .subs(exitCode).subs(actionName).subs(filePath)

then, at argument substitution moment, empty placeholders get assigned ordinals as available taking into account existing explicit ordinal placeholders; in this example, the two empty placeholders get ordinals 2 and 3 assigned, respectively. Although well-defined in this way, mixing ordinal and empty placeholders in a single message is likely poor style in any case.

A few additional notes on the placeholder format. The placeholder keyword would be limited in composition to letters and digits (in Unicode classification), underscore and hyphen, and it would have to start with a letter. The same holds for the ~name extension. The order of extensions would be free, except for the format sequence extension (described later) which would have to be the last. A fixed order is technically not necessary, it would only annoy the programmer when more than one extension is used.

When generalized Gettext calls are used, call specifications as provided with the -k option to the xgettext command would contain additional component ,x. This would tell xgettext to ignore any native formatting directives of the programming language, and simply add ggx-format flag to all messages extracted from these calls. The reason why not to add ggx-format only to strings with uniform placeholders lies in the fact that, due to argument capturing, every string is now checked for consistency between placeholders and supplied arguments. So the translator should also not be able to accidentally add a placeholder-looking segment when the original string has none.

3.1. Selection of Plural-Deciding Argument

In generalized plural calls, there is no explicit plural-deciding number given as the call argument. Instead, the translation object (returned by the generalized translation call) selects one of the arguments as plural-deciding, at the moment when it gets converted into a native string (after all arguments have been supplied). How the plural-deciding argument gets selected was partly covered in the section on argument capturing, what remains to do here is to define how this works with different types of placeholders.

When only one argument is substituted, then obviously that argument is the plural-deciding one independently of the placeholder type:

# Python
nxgettext("{numsel} file selected", "{numsel} files selected",
          numsel=nselfiles)
nxgettext("{1} file selected", "{1} files selected", nselfiles)
nxgettext("{} file selected", "{} files selected", nselfiles)

When more than one argument is substituted, then it depends on placeholder types. If there are any ordinal placeholders in the string, an integer type argument which corresponds to the lowest ordinal placeholder is selected. Here the ntotfiles argument is deciding plural:

# Python
nxgettext("{2} of {1} file selected", "{2} of {1} files selected",
          ntotfiles, nselfiles)

If there are empty placeholders in the string, then after they get assigned ordinals, the behavior is the same as with ordinal placeholders. If there are no ordinal or empty placeholders which correspond to an integer argument, but there is exactly one keyword placeholder which does correspond to an integer argument, that argument is used to decide plural:

# Python
nxgettext("Move {num} files to {destdir}?",
          "Move {num} files to {destdir}?",
          num=num_files, destdir=dest_dirpath)

In any other case, the plural-deciding argument must be explicitly marked using the !n placeholder extension:

# Python
msg = nxgettext("{numsel} of {numtot!n} file selected",
                "{numsel} of {numtot!n} files selected",
                numsel=nselfiles, numtot=ntotfiles)

These rules mean that, in practice, plural selection "just works" as long as there is exactly one number argument in the message; and as soon as there are two numbers, even if implicit selection would do, it is best to explicitly mark the plural-deciding argument.

The translator can always use explicit marking, whether it was used in the original string or not. This means that if there are two or more numbers in the message, the translator can also change which argument decides the plural (cf. the example in argument capturing section).

If the plural-deciding argument cannot be automatically selected, or the explicitly marked argument is not of an integer type, an arbitrary number (at best random) is used to decide the plural form and (possibly in debug mode only) a problem is signaled.

3.2. Formatting Directives vs. Placeholders

Up until now, the terms "formatting directive" and "placeholder" were used more or less interchangeably, but let us now define a concrete difference between them. A placeholder only indicates the position where the argument is substituted, and which argument is substituted. A formatting directive additionally provides type conversion and formatting sequences for certain types of arguments, such as %+12.4f or ~12,4@f for floating point numbers.

Should then there really be only uniform placeholders, or actually uniform formatting directives? The answer is definitely -- only placeholders. There should be no argument formatting sequences within translatable strings. There are several reasons for this strong position.

The first reason is based on a survey of existing translations. The Gnome Translation Project contains a large number of user interface catalogs with printf style formatting directives, to a greater or lesser extent translated into over 160 languages. It is therefore a good single place to examine how frequently translators modified formatting directives[14] as found in the original text, and what were the changes about. The examination of Gnome 3.0 (Gnome core modules) translations yielded the following results:

  • In 2,141,000 translated messages there were 265,000 formatting directives, out of which 151 were modified in translation (0.06% or 1 in 1700).

  • Out of the 151 modified formatting directives, 3 seemed intentionally modified, 20 unnecessarily modified, and 128 erroneously modified.

  • An erroneous modification is the one which, while syntactically correct and matching the original formatting directive (passes msgfmt run with --check option), is obviously semantically wrong in some sense, likely due to a typo or quick unfuzzing. Among the more serious examples of this, in this translation the reported quantity was too much rounded to be of use:

    #. TRANSLATOR: This is pressure in atmospheres
    #: ../libgweather/weather.c:981
    #, c-format
    msgid "%.3f atm"
    msgstr "%.1f atmosfera"
    

    while in this translation the intention was to reorder the arguments but that was not done:

    #. Translators: ...
    #: ../src/ui/theme-parser.c:202
    #, c-format
    msgid "No \"%s\" attribute on element <%s>"
    msgstr "Elemendil <%2s> pole \"%1s\" atribuuti"
    

    (%1s/%2s instead of %1$s/%2$s).

  • An unnecessary modification is the one which looks like an arbitrary deflection from the original, so it is likely also unintentional. For example, in:

    #. TRANSLATOR: This is pressure in millimeters of mercury
    #: ../libgweather/weather.c:965
    #, c-format
    msgid "%.1f mmHg"
    msgstr "%.2f мм рт. ст."
    

    can it really be that the target language speakers expect atmospheric pressure readout as 756.23 mmHg, while others are fine with 756.2?[15]

  • The 3 (out of 265,000 total) likely intentionally modified formatting directives were all time formats like this one:

    #. Translators: This is %2i minutes %02i seconds
    #: ../src/gpm-graph-widget.c:449
    #, c-format
    msgid "%2im%02i"
    msgstr "%02i min %02i"
    

These findings make it clear that formatting sequences are extremely rarely modified by translators, and when they are, almost exclusively unintentionally. Furthermore, there are many more modifications which result in lost information or reduced comprehensibility alone than the intentional modifications. It is much better to handle language-specific argument formatting by relying on the locale library of the coding framework (which provides all of the number, time, date, etc. formats in various verbosity levels), or, in the rare cases when that is not sufficient, by using a well-commented meta-message which asks translators only for necessary formatting details.

The second reason why formatting sequences in translatable strings are a bad idea was demonstrated by the same survey. A set of around 60 modified formatting directives (out of 151 total) went like this, in various languages:

#. TRANSLATORS: tell the user how much time they have got
#: ../src/gpm-manager.c:1505
#, c-format
msgid "%s of battery power remaining (%.0f%%)"
msgstr "Απομένουν %s λειτουργίας της μπαταρίας (%.1f%%)"
#. TRANSLATORS: tell user more details
#: ../src/gpm-manager.c:1637
#, c-format
msgid "Wireless mouse is low in power (%.0f%%)"
msgstr "Bezvadu peles baterija ir gandrīz tukša (%.1f%%)"
#. TRANSLATORS: tell user more details
#: ../src/gpm-manager.c:1644
#, c-format
msgid "Wireless keyboard is low in power (%.0f%%)"
msgstr "Haririk gabeko teklatuak energia baxua du (%% %.1f)"

What had happened here is that the programmer decided, between Gnome 2.32 and Gnome 3.0, that displaying charge percentages for various devices using one decimal is needlessly verbose, so he switched to rounding to nearest integer. He went through all the affected messages in the code changing %.1f into %.0f, making them fuzzy in all translations. Some translators didn't notice the change when unfuzzing those messages, and hence didn't propagate the readability improvement of the original text into the translation. Had the code been written to have messages like this:

#. TRANSLATORS: tell user more details
#: ../src/gpm-manager.c:1644
#, ggx-format
msgid "Wireless keyboard is low in power ({charge}%)"
msgstr ""

it would have been possible to change the formatting in a single place (more on this shortly), and there would have been no extra work at all for translators.

The third and final reason to keep formatting sequences out of messages has to do with this question: what kind of formatting sequences would be available? If the choice would be, say, a subset of printf formatting sequences, many programmers in non-C coding frameworks would feel overly constrained. For example, in Python's string.format formatting sequences are completely free, in that each class can define a method to parse its own formatting directive syntax. A similar question is that of which locale provider to use for formatting. For example, a GTK+-based program on a GNU system would use glibc functionality and locale settings, but a Qt-based program would use Qt's locale functionality and settings. These questions are impossible to answer to everyone's satisfaction, and would both complicate the implementation and impede the acceptance of generalized Gettext.

Instead of in-string formatting sequences, argument formatting would be delegated to different bindings of generalized Gettext. In Python, for example, .subs methods would take the formatting sequence as the third argument:

# Python
charge_fmt = ".1f"
...
if device == DeviceType.Mouse:
    msg = xgettext("Wireless mouse is low in power ({charge}%)")
elif device == DeviceType.Keyboard:
    msg = xgettext("Wireless keyboard is low in power ({charge}%)")
...
msg = msg.subs("charge", device_charge, charge_fmt)

Variadic-keyword calls could treat two-tuples as pairs of argument and its formatting sequence:

# Python
xgettext("Wireless mouse is low in power ({charge}%)",
         charge=(mouse_charge, charge_fmt))

In C++/Qt, .subs methods would take over the argument-based formatting of QString::arg methods:

// C++/Qt
xgettext("Wireless mouse is low in power ({charge}%)")
         .subs("charge", mouseCharge, 0, 'f', 1)

.subs methods would be implemented as thin wrappers around coding framework's native argument formatting facilities. In the Python example above they would internally execute ("{:%s}" % fmtseq).format(arg) to get the formatted argument, whereas in the C++/Qt example they would execute QString("%1L").(arg, fieldWidth, format, precision). In this way, coding framework's locale functionality would be automatically applied as well.

Having defended the position that formatting should be kept out of messages, corner cases may turn up where in-string formatting is the least bad of all available options. For this reason, in-string formatting could actually be allowed, if strongly advised against in the Gettext documentation (including the rationale), and made to require explicit activation, per PO domain. Activation could go something like this:

// C++
bindtextdomain("fooapp", LOCALE_DIR);
textdomain("fooapp");
Gettext::Composer::setFormatting("fooapp", true);

After that, formatting sequences could be specified as :fmtseq placeholder extensions (like in Python's string.format):

// C++
xgettext("Wireless mouse is low in power ({charge:.1f}%)"))
        .subs("charge", mouseCharge)

If the formatting extension is used on a placeholder but in-string formatting is not activated for the PO domain to which the call belongs, the formatting sequence would simply be ignored. Formatting sequences could contain any characters whatsoever (with braces escaped by doubling), and would be terminated by the placeholder's closing brace. This means that the formatting extension must be the last extension in a placeholder.

Another important thing in presence of in-string formatting is how message extraction and translation validation would behave. Since formatting sequences can be arbitrary, the only possibility would be for xgettext to simply accept everything between the colon and the closing brace as a valid formatting sequence, and for msgfmt to perform no checks on it. This may seem like an uncomfortable gap in coverage, but pure placeholder correspondence (keywords, ordinals) would still be checked, and, as we have seen before, formatting sequences can be logically wrong even when syntactically correct.

4. Customizable Markup

In most user interface toolkits, at least in some places the text can be rendered with typographic elements. These include cursive or bold emphasis, highlighting with colors, and so on. For example, if the destination accepts a subset of HTML tags, it may be decided to represent file paths in bold face:

// C++/Qt
QString msg = gettext("File <b>%1</b> not found.").arg(filePath)
QErrorMessage *dlg = new QErrorMessage();
dlg->showMessage(msg);

Even plain text contains typographic elements in the sense of using certain character sequences to delimit certain substrings. In this message:

// C++/Qt
gettext("File \"%1\" not found.").arg(filePath)

it has been decided to delimit file paths with double quotes. To avoid arbitrary style varitions, a convention should be picked to represent certain types of substrings and adhered to throughout the code. This may not be easy even for a single programmer, and much less so when several people work together.

Sometimes it is not known up front what capabilities the output destination has, or the output destination is dynamically selected among several with different capabilities. For example, both displaying and logging the error:

# Python/GTK+
msg = gettext("File <b>%s</b> not found.") % file_path
dlg = gtk.MessageDialog(None, gtk.DIALOG_MODAL, gtk.MESSAGE_ERROR,
                        gtk.BUTTONS_CLOSE, msg)
logfile.write("%s: %s\n" % (current_time(), msg))

This would cause the log file (plain text) to contain verbatim tags. The alternative is to make the logging function heuristically strip tags, which may strip too much, or not to use markup at all, which is a loss for the graphical dialog. The message may also be sent to standard output, where if it is connected to a color-capable terminal, it would be nice to highlight parts of the message with colors.

Not all markups are XML-like, and some markups are restricted to a certain domain or project. In this message, only the second egg is user-visible text, while the first one is an internal link ID:

#. [topic]: id=breeding_cycle
#: data/core/encyclopedia/drakes.cfg:26
msgid ""
"The time that passes between one <ref>dst=egg text='egg'</ref> "
"laying to the next."
msgstr ""

Different markup types present a problem both for translators and for translation support tools (validation, syntax highlighting, spell-checking, etc). This holds to a lesser extent even among different XML-like markups: because there it is not certain which tags and attributes are available, translators take a risk if they try to improve on the markup in translation, or even the validation tool may not allow it.

Finally, what looks like markup may not be markup at all, and should be translated with the rest of the text:

#: main.cpp:107
msgid "Start timer for task <taskid>"
msgstr ""

While cases such as this may be easy for a human translator to recognize (at least most of the time), they represent a problem for translation support tools.

For these reasons, it would be good if generalized Gettext would offer its own markup. Due to widespread familiarity to both programmers and translators, this markup should be XML-based. It makes sense for the markup to be semantic rather than visual. Then, it could transform into the appropriate target markup (usually a visual one) for the output destination. With generalized Gettext markup, the above example could look like this:

# Python/GTK+
from xgettext import xgettext, Composer
Composer.markup_set_class_rich("fooapp", "pango")
...
msg = xgettext("File <filename>{path}</filename> not found.",
               path=file_path)
dlg = gtk.MessageDialog(None, gtk.DIALOG_MODAL, gtk.MESSAGE_ERROR,
                        gtk.BUTTONS_CLOSE, msg.to_rich())
logfile.write("%s: %s\n" % (current_time(), msg.to_plain()))

The translatable string itself is now neutral both with respect to the coding framework and to different output destinations provided by that coding framework. When msg.to_plain is executed, <filename>...</filename> segment will be resolved into a native string with "..." around the file name, and on msg.to_rich into <b>...</b>. There would be a few to_target methods, corresponding to typical target markup classes, and the code would set the exact target markup for the given class, per PO domain. This is what the Composer.markup_set_class_rich call above did.

Unless the target markup was explicitly set on the given translation object (later more on this), the basic to_string method will resolve it into the default target markup. This would normally be plain text, but could be explicitly set, per PO domain:

Composer.markup_set_default("fooapp", "pango")

It would also be possible to supply the target markup keyword as an argument, i.e. to_string(markup="pango"), overriding any other prior markup setting.

Presence of markup implies that all generalized Gettext messages must be valid with respect to it. This means that in apparently plain text messages, XML-special characters would have to be properly escaped:

xgettext("Start timer for task &lt;taskid&gt;")
xgettext("Date &amp; Time")

Strictly speaking, the only necessary escapes are &amp; (&) and &lt; (<), while &gt; (>) would follow for stylistic reasons. There is no need for &quot; (") and &apos; (') outside of attribute values, when the message would no longer look like plain text anyway.

When directly using the coding framework specific markup with ordinary Gettext, to repeat a previous example:

# Python
msg = gettext("File <b>%s</b> not found.") % file_path

consider what will happen if the file path argument itself contains something that looks like a tag. The output destination may do various things with this, like simply remove the "unrecognized tag" and thereby cause wrong information to be shown to the user. This means that the programmer has to remember to escape arguments as necessary before substituting them into strings with markup. With generalized Gettext markup, this is no longer the case: arguments would be automatically escaped. But then you may wonder about another earlier example, slightly reworked:

// C++/Qt
Gettext::Composer::markupSetClassRich("fooapp", "qtrich");
...
QList<Gettext::Translator> quips;
quips.append(xgettext("It's full of <emph>what</emph>?!"));
...
QString note = xgettext("A new quip has just been uttered: {quip}")
                        .subs("quip", quips[indSelected]).toRich();
# Python/GTK+
Composer.markup_set_class_rich("fooapp", "pango")
...
quips = []
quips.append(xgettext("It's full of <emph>what</emph>?!"))
...
note = xgettext("A new quip has just been uttered: {quip}",
                quip=quips[ind_selected]).to_rich()

Since xgettext returns a translation object, automatic escaping can be suppressed when the argument is another translation object. This ensures that argument substitution never invalidates markup, that translations can be freely substituted one into another, and that conversion to native string results in the expected target markup throughout the composed text.

Having to call the appropriate conversion method for the desired target markup all the time is certainly not pretty. As was mentioned in the section on argument capturing, the most elegant solution is if output destinations are aware of translation objects, and perform the conversion internally. For example, in a batch processing program, there can be one central output function which converts the translated object and dispatches the native string to a given file descriptor:

# Python

def report_to_file (file, msg):
    if file.isatty():
        msg = msg.to_term()
    else:
        msg = msg.to_plain()
    msg += "\n"
    msg = msg.encode(get_locale_encoding())
    file.write(msg)

def report_to_stdout (msg):
    report_to_file(sys.stdout, msg)

...

msg = xgettext("Stopped with error {code} (<emph>{text}</emph>).",
               code=error_code, text=error_desc)
report_to_file(msg, logfile)
report_to_stdout(msg)

When output destinations can only accept native strings, two less verbose conversion variants were proposed, implicit conversion and immediate resolution:

// C++/Qt
QString msg1 = xgettext("Processing {1}...").subs(aPath).toString();
// ...is equivalent to...
QString msg2 = xgettext("Processing {1}...").subs(aPath); // implicit
QString msg3 = ixgettext("Processing {1}...", aPath); // immediate

With generalized markup in the picture, the question is how to specify the target markup here, since there is no explicit conversion method call.

In the implicit conversion variant, the target markup could be specified by setter methods corresponding to conversion methods:

// C++/Qt
QString msg1 = xgettext("Processing {1}...").subs(aPath).toRich();
QString msg2 = xgettext("Processing {1}...").setRich().subs(aPath);

It is irrelevant when precisely the setter method is called, whether before, after or in between substitution of arguments. But what happens in this case:

// C++/Qt
Gettext::Translator q = xgettext("It's <emph>what</emph>?!").setRich();
QString msg = xgettext("Someone protests: {1}").subs(q).setPlain();

When translation objects that have different target markups set are substituted one into another, the target markup of the outermost object wins. This is the object whose implicit conversion method gets called, so its target markup can be recursively propagated onto all substituted translation objects.

In the immediate conversion variant, target markup must be supplied as an argument to the call. In C++, one way would be to have an enumeration of target markup classes, which could be given as call argument:

// C++/Qt
const Gettext::Composer::MarkupClass RT = Gettext::Composer::Rich;
...
QString msg = ixgettext("Processing <fn>{1}</fn>...", RT, aPath);

Recall that ixgettext in C++ would be implemented as a set of templates, so this would imply a dummy .subs method which takes the markup class enumeration argument. A similar approach would work in Python, where the gengettext module would provide an enumeration-like class and its instances to represent target markup classes.

A third, semi-automatic way to set target markup, which does not rely on output destinations, does not require explicit setting of target markup, and works for all conversion variants, is presented in the section on interface contexts.

4.1. Tag Set and Target Markups

If it were a usual static markup language, the hardest part about defining generalized Gettext markup would be to come up with the appropriate set of tags and attributes. Even if the set of tags would be extensive, someone would still miss something; worse yet, people would feel bad when forced to pick one of the several tags none of which is quite what they want.[16] Some people would feel that tag names such as <filename> are too verbose, and would rather have such a frequently used tag be called just <file>, or even <fn>. Some would consider semantic markup an unnecessary complication and be satisfied with visual markup (especially when there is a single known output destination).

A related problem is that of what resolution pattern should be used for a particular tag, when converted into a given target markup. In a HTML-like target markup, some would want <filename>...</filename> to become <b>...</b>, while others would want <tt>...</tt>. Taking one step back, what target markups would be available at all? What about problem-specific, ad-hoc markups?

The proposition to handle these issues is twofold.

Firstly, generalized Gettext markup would provide a built-in extensive set of tags, even synonymous, such as all of <filename>, <file>, and <fn>. It would provide built-in resolution patterns for those tags for a number of widely used user interface markups, such as Qt Rich Text and Pango Text Attributes. New tags and new target markups would be periodically added, as people contribute them. Criteria for accepting tags would not be too strict, e.g. adding a <player> tag would be fine, while an <albatross> tag probably would not.

Secondly, there should be a facility for programmers to define custom tags and target markups in their code, as well as to override resolution patterns for known target markups. The programmer would first search the list of built-in tags for an appropriate tag, trying out a few synonyms; if there is none, the programmer would simply define a new tag. Or, built-in tags could be ignored altogether. Custom tags would be grouped under a name same as the name of a chosen PO domain. Custom resolution patterns would be given per tag (and possibly attribute combinations) and arbitrary target markup name.

Assuming that one does not want to rely on any of the predefined tags and target markups, here is how a complete definition of a small set of custom tags and resolution patterns would look like[17]:

# Python/GTK+

def setup_markup ():

    domain = "fooapp"

    from xgettext import markup_def
    markup_def(domain, "plain", "place", [], "'{_text_}'")
    markup_def(domain, "pango", "place", [], "<i>{_text_}</i>")
    markup_def(domain, "plain", "unit", [], "'{_text_}'")
    markup_def(domain, "pango", "unit", [], "<u>{_text_}</u>")
    markup_def(domain, "pango", "unit", ["color"],
        "<u><span foreground='{color}'>{_text_}</font></u>")

    from xgettext import markup_set_escape, escape_xml
    markup_set_escape(domain, "pango", escape_xml)

    from xgettext import markup_set_class_plain, markup_set_class_rich
    markup_set_class_plain(domain, "plain")
    markup_set_class_rich(domain, "pango")


def report_event (...):
    ...
    elif ...:
        msg = xgettext("<unit color='{co}'>{un}</unit> enters "
                       "<place>{pl}</place>.",
                       co=unit1.owner().color(), un=unit1.name(),
                       pl=place.name())
    ...
    write_to_event_view(msg.to_rich())
    write_to_log(msg.to_plain())

markup_def(domain, markup, tag, attributes, pattern) call defines a valid combination of a tag and its attributes, and its resolution pattern, for the given domain and target markup. In this example, <place>-tagged segment resolves into single-quoted segment in plain text, and into italic segment in rich text provided by Pango markup. <unit>-tagged segment resolves also into single-quoted segment in plain text, but into underlined segment in rich text. If there is also the color= attribute to <unit>, then the resulting segment in rich text is additionally colored. But, there is no definition of <unit> with color= for plain text -- this means that resolution to plain text falls back to attributeless <unit> definition, effectively ignoring the color= attribute. Resolution patterns use the familiar braced placeholder syntax for substituting element text and attribute values, where {_text_} automatically denotes the element text. The substitutes may need to be escaped in the target markup, so an escaping function can be set using markup_set_escape(domain, markup, escapefunc) call. After custom markup has been defined, markup_set_class(domain, markup) are used to link markup classes to target markups, so that to_class methods of translation objects produce expected native strings.

Simple replacement resolution pattern may not always be sufficient. It would therefore be possible to use an arbitrary function instead. This function would get as arguments the complete processing context: the current tag, its attributes, the element text, and the path of parent tags from top. It would be even possible to have a single markup_def call per target markup:

# Python/GTK+

def setup_markup ():

    ...
    markup_def(domain, "plain", resolve_to_plain)
    markup_def(domain, "pango", resolve_to_pango)
    ...


def resolve_to_pango (tag, attrs, text, path):

    if tag == "place":
        return "<i>%s</i>" % text
    elif tag == "unit":
        if "color" in attrs:
            return ("<u><span foreground='%s'>%s</font></u>"
                    % (attrs["color"], text))
        else:
            return "<u>%s</u>" % text
    return text

Availability of the parent path enables to contextually resolve the tag. A typical example of when this is needed is an emphasis within an emphasis.

Resolution patterns may contain parts which are language dependent. In the previous example, these are (at least) the patterns containing single quotes, since quoting standards vary across languages. This is handled simply by exposing the resolution pattern itself for translation, albeit with a small twist:

markup_def(domain, "plain", "place", [],
           ptgettext("a place name in plain text",
                     "'{_text_}'").to_string())

Since this message is a meta-message with respect to markup, anything that looks like markup in it must be ignored on resolution to native string. This is why the ptgettext call is used, which is the context variant of tgettext basic call. *tgettext calls are also generalized Gettext calls, where t* stands for "plain text": unlike *xgettext calls, *tgettext calls do not process markup. In order for the translation of this message to be properly validated (see the section on XML vs. plain text calls), the message will get the ggt-format flag and not the ggx-format.

Instead of defining a custom markup from the ground up, it would be possible to import markup defined by another PO domain and then add to it and adapt it. This implies that the code which uses that other domain has been already loaded and executed its markup setup, i.e. an underlying library. This is how the built-in markup of the generalized Gettext would be used in the first place:

# Python/GTK+

def setup_markup ():

    domain = "fooapp"

    from xgettext import markup_import, markup_def
    # ...the Gettext library is loaded and built-in markup created.

    # Declare built-in markup to be in use in this domain.
    # "gengettext" is the PO domain of generalized Gettext library.
    markup_import(domain, "gengettext")

    # Define additions and overrides.
    markup_def(domain, ...)
    markup_def(domain, ...)
    ...

4.2. No Custom Entities

A usual element of XML-based markups are custom entities. The default set of entities, needed for escaping, is &amp;, &lt;, &gt;, &quot;, &apos;, but the author can define arbitrary custom entities. It would not be hard to support this in generalized Gettext markup:

# Python/GTK+

def setup_markup ():
    ...
    markup_set_entity(domain, "thisprog",
                      xgettext("FooProcessor").to_string())
    ...

def some_func ():
    ...
    msg = xgettext("If you enable frobaz, &thisprog; will [...]")
    ...

This example looks attractive. If &thisprog; entity is used in place of the program name everywhere, then it becomes "easy" to later change the name, or change the exact spelling, by changing a single line of the code. The program name is exposed to translators for adaptation (translation, transliteration); if the name changes, only this message becomes fuzzy in the PO file, and translators can adapt it again.

Unfortunately, there is a glaring problem with this scheme. In many languages, nouns change their form depending on the function in the sentence. For example, the name "Okteta" has case declensions "Oktete", "Okteti", "Oktetu", "Oktetom" in my native language. Then, other parts of the sentence may depend on properties of the inserted name. For example, "Program started" would be "Okular pokrenut" but "Okteta pokrenuta", because the grammatical gender of "Okular" is masculine, and of "Okteta" feminine.

This means that custom entities are too dangerous to be allowed, even if in some particular cases they could be properly used. When such a case really happens, it is possible to use ordinary argument substitution instead.

Generalized Gettext can, however, provide a fixed set of entities beyond the default XML set. They would be added when judged generally useful and not problematic for translations. An example good candidates are entities for various whitespace characters, such as non-breaking space (&nbsp;), which are hard to spot and may be unintentionally replaced if written out literally.

4.3. Validating Markup

In current Gettext, there is no formalized way to validate coding framework specific markup. msgfmt with --check option will simply treat anything that is not formatting directive as literal text. Invalid markup in translation is normally not as big an issue as invalid formatting directives, since markup processors will try to do their best to render the text or simply strip invalid markup. But it is an issue nonetheless, and it would be a pity not to do anything about it when having Gettext's own markup.

Markup validation splits into two sources and two forms. The two sources are translation and the original text. Either can have invalid markup, and there should be a way to validate both. The two forms are validation at run time and static validation. Markup can obviously be checked at the moment of resolution, at run time (when translation's object conversion method gets called), and warnings can be produced if it is not valid. But run time validation may be too late, especially for translations which get less testing during development than the original text, so static validation would be nice to have.

What are the options for static markup validation? For XML-based fixed formats, the simplest widely used validation option is a DTD (document type definition). But it has two problems. The first is that, since here we talk about customizable markup, it would mean that everyone customizing something would need to write an accompanying DTD file. The second problem is that the complete markup text may not exist statically at all, since it is frequently composed at run time from separate pieces (e.g. paragraphs in a help tooltip). This precludes application of a DTD, or any other standard XML validation scheme.

Luckily, there are two exploitable specificities of user interface text markup when compared to typical text document markup.

The first specificity is that the biggest "document" in user interface is a title and a few paragraphs, possibly with a table or a figure. If one such piece of markup text is invalid, it will not interfere with all the other text in the user interface. And when it is invalid, the markup transformation engine can be forgiving and work around problems. There is no need to simply display an error and leave the user without any text at all. (This is also how current coding framework specific lightweight markups work.)

The second specificity is related to translation. An examination of markup-related errors (that could be detected) in KDE Translation Project shows that over 95% of errors were about lack of well-formedness. Such errors are usually introduced by omissions, typos, and reformulations of the text; also on unfuzzying, when markup in original text was corrected. For example, in the following message the closing </a> was omitted:

msgid ""
"<b>Note: [...] See <a href='[...]'>LensFun project web site</a> "
"for more information.</b>"
msgstr ""
"<b>Hinweis: [...] Für mehr Informationen besuchen Sie "
"<a href=\"[...]\">die Internetseite des LensFun-Projekts</b>."

and in this one the first opening <b> tag name was mistyped as <bb>:

msgid ""
"[...] For example, the <b>uncrustify</b>, <b>astyle</b> or "
"<b>indent</b> formatters can be [...]"
msgstr ""
"[...] Par exemple, les outils de formatage <bb>uncrustify</b>, "
"<b>astyle</b> ou <b>indent</b> peuvent [...]"

In the rest 5% of detected cases, an unknown tag or an unknown attribute was used while markup remained well-formed. Although no other types of cases could be detected by the used validation tool (such as invalid tag nesting), experience puts their upper limit to trace amounts.

These two points lead to a lightweight possibility for "almost validating" markup.

First, it would be required that every message is well-formed on its own, so that well-formedness can be statically checked. This means that the following splitting would result in two invalid messages, although the final composition would be valid:

from gengettext import xgettext, Composer
...
start = xgettext("<para>If you click [...] get the desired effect. ")
if something == "foo":
    insert = xgettext("This will work only [...] is enabled. ")
else:
    insert = ""
end = xgettext("After that, click [...].</para>")
msg = Composer("{}{}{}", start, insert, end).to_string()

The splitting should be instead done as[18]

para = xgettext("<para>If you click [...] get the desired effect. "
                "{optional-sentence} After that, click [...].</para>")
if something == "foo":
    insert = xgettext("This will work only [...] is enabled.")
else:
    insert = ""
msg = para.subs("optional-sentence", insert).to_string()

A non-well formed message would have either its tags stripped (in release mode) or replaced with conspicuous error indicators (in debug mode), so that the programmer is nudged to fix it.

Then, if the original text in messages is always well-formed, the same can be required of the translation. That alone would catch over 95% of all markup-related translation errors made in practice. Taking into account that markup transformation engine is forgiving, it may even be fine to stop there and declare checking well-formedness to be sufficient validation.

To additionally catch unknown tags and attributes in translation is already much harder, at least organizationally. This is because the validation tool would have to know which tags and attributes are usable in the given PO domain. One way to provide this information -- perhaps too crude -- could be to list all tags and attributes in a simple format, in a special comment that xgettext would extract into POT file header:

def setup_markup ():

    ...

    # Functional markup definition...
    markup_import(domain, "gengettext")
    markup_def(domain, "pango", "action", [], ...)
    markup_def(domain, "pango", "place", [], ...)
    markup_def(domain, "pango", "place", ["link"], ...)
    markup_def(domain, "pango", "unit", [], ...)
    markup_def(domain, "pango", "unit", ["link"], ...)
    markup_def(domain, "pango", "unit", ["color"], ...)
    markup_def(domain, "pango", "unit", ["link", "color"], ...)

    # ...and specification for extraction.
    # xgettext: markup-spec type1
    #   >xgettext;
    #   action:;
    #   place: link;
    #   unit: link color;

The # xgettext: markup-spec type1 tells to xgettext that the rest of the comment is a "type-1" markup specification; this leaves open the possibility for more sophisticated markup specifications in the future. Type-1 specification is a list of semicolon-terminated entries, where each entry may be an inclusion of markup from another PO domain (>domain;) or a tag and its attributes (tag: attr1 attr2 ...;). This would appear in the POT header as:

msgid ""
msgstr ""
"Project-Id-Version: fooapp 1.3.5\n"
"..."
"Customizable-Markup: type1; >xgettext; action:; place: link; "
"unit: link color;\n"

When the validator sees a domain inclusion in the specification, it would in turn parse the header of that other domain for its markup specification. This leads to a problem of how to locate the appropriate PO/POT file for that domain.

The following question may spring to mind, especially if the original text is always well-formed: instead of going to lengths specifying markup, why not simply require that the translation contains the same tagging as the original? The translator would be able to change relative positions of elements, but in the end the translation would have to contain exactly the same elements, and to nest them in the same way, as in the original text. The problem is that this would be overly constraining:

  • If the same tagged substring is repeated twice in the original text, in translation it may instead be needed to mention it once, with a helper word in place of the second mention; or vice-versa. An example is a name of something, where the second mention may be carried out with a pronoun.

  • If the original text uses visual markup (which is possible, since customizable markup can be anything), in translation some of the tags may need to be removed or replaced with other visual tags. An example is bold face in CJK languages, which may make densely-packed ideograms hard to read.

  • If markup is somewhat inconsistent in the original text, with different tags (or no tags) being used for same things, the translator should not be prevented from improving on that in the translation. This can happen when many programmers are working on different parts of the code, whereas a single translator has the total overview.

4.4. Accelerator Markers

In user interfaces it is frequently possible to activate a visible item, such as a button or a menu entry, by pressing a modifier key and a letter assigned to that item, instead of clicking on it. (This is not the same concept as command shortcuts, where pressing one or several modifier keys and a character key will activate an action regardless of the UI element that can be clicked on for the same purpose.) This activation letter, called the accelerator, is most usually assigned to the item by placing a special character in item's text label.[19] In this example:

msgid "&Roll Dice"
msgstr "&Baci kockice"

the accelerator in the original text is the letter R, and in the translation the letter B. The character used to mark the accelerator, here ampersand (&), is called the accelerator marker.

There are two problems with accelerator markers when it comes to translation processing (there are more problems in general, but these are beyond the scope of this document). The first problem is that accelerator markers differ between coding frameworks: in Qt it is ampersand (&), in GTK+ underscore (_), in OpenOffice tilde (~), etc. This means that, for example, a search tool cannot know which character to ignore, and a spell-checking tool cannot with certainty eliminate the accelerator marker from the middle of a word. The second problem is that it is not known if the text can contain the accelerator marker at all, to escape it (and how) if a literal character is needed in translation.

For these reasons, generalized Gettext must also introduce uniform accelerator markers. This includes selecting the accelerator marker character, the way to escape it, and making this work in every message. This discussion naturally falls within the area of markup, because accelerator marking is nothing but markup in disguise, a shorthand for something like this:

msgid "<acc>R</acc>oll Dice"
msgstr "<acc>B</acc>aci kockice"

Among the accelerator markers used so far, probably the best choice is tilde, ~ (used by OpenOffice). By examining various translation projects, it can be seen that tilde appears as non-accelerator[20] two orders of magnitude less frequently than ampersand or underscore. A literal tilde would be written by doubling it, which makes it consistent with escaping of placeholder braces. Escaping would be processed in any ggx-format (or ggt-format) message, so that translators and translation support tools can always consider a lone tilde to be an accelerator marker. The above example would look like:

#, ggx-format
msgid "~Roll Dice"
msgstr "~Baci kockice"

and escaped literal tilde in a non-accelerated text:

#, ggx-format
msgid "Sample emoticon :) :( :~~( :0 :/ :P :| (* message."
msgstr "Primer poruke sa emotikonima :) :( :~~( :0 :/ :P :| (*."

Just like with tags, what the uniform accelerator marker resolves into would depend on the output destination. This would be defined when setting up the markup, per PO domain and target markup:

# Python/GTK+

def setup_markup ():

    ...

    markup_def(domain, "plain", "emph", ...)
    markup_def(domain, "pango", "emph", ...)
    markup_def(domain, "plain", "path", ...)
    markup_def(domain, "pango", "path", ...)
    ...

    markup_set_accel(domain, "plain", "_")
    markup_set_accel(domain, "pango", "")

    ...

To actually use normal tags for marking accelerators would not be good, because most strings which need accelerator marker otherwise need no markup. It is also better for consistency, lest different codes pick different tags, and better for translation processing tools, since they can with certainty throw out the accelerator marker without touching anything else. Finally, accelerator markers must also be processed in non-markup aware, ggt-format messages.

4.5. Customizable Markup is Optional

Of all the proposed extensions in this document that make up the generalized Gettext, customizable markup is by far the most demanding on programmers. For comparison, once argument capturing is swallowed, all other extensions feel conceptually the same as current Gettext. They may only differ slightly in appearance and provide more options, and some are even invisible to programmers (translation scripting).

In current Gettext, when a message is not translated in any of loaded language catalogs, one gets out exactly the string which was put in; the gettext call is effectively a no-op. With xgettext call this is no longer so, as the original string itself will be transformed according to markup. Programmers have to keep in mind to escape XML-special characters in plain-text looking messages. They have to make sure that messages are resolved into the appropriate (or at least acceptable) target markup for the given output destination. They should take care to substitute translation objects as arguments directly, rather than after converting them to native string.

To my taste, this is a small price to pay for being able to use semantic-like markup with different destination outputs, for not worrying if my obscure coding framework specific markup will confuse translators, for not thinking about escaping arguments when substituting them into strings which will be interpreted for markup, and for having ready-made means to validate markup in both source and translated messages. But tastes differ, and certainly there will be many programmers who dislike the added constraints that customizable markup brings with it.

The solution to keep everyone happy has already been mentioned: there would also exist the tgettext series of calls, which behave like xgettext calls in all aspects other than the markup, which they ignore. Messages in the PO file would have either ggx-format or ggt-format flag, depending on the call from which they were extracted. This removes any ambiguity for translators and translation processing tools. Consider the message:

msgid "Replace 'foo' with 'bar'."
msgstr ""

and imagine that the translator would like to use angled brackets around the "foo" and "bar" words. In ggx-format the valid translation would be:

#, ggx-format
msgid "Replace 'foo' with 'bar'."
msgstr "Zameni &lt;foo&gt; sa &lt;bar&gt;."

In ggt-format, < and > are not special characters, and the translation should be:

#, ggt-format
msgid "Replace 'foo' with 'bar'."
msgstr "Zameni <foo> sa <bar>."

The question that remains is how will the xgettext command decide which flag to put on a message. This is done by adding extra elements to call specification in the -k option: ,x for ggx-format calls, and ,t for ggt-format calls.

5. Interface Contexts

While translating, the translator needs to know the context in which the message is used, or risk making a bad translation. The shorter the message, the higher the need for context. Also, different languages may need more or less context, depending on their dissimilarity with the source language (usually English).

But what counts as context? Obviously, the most important context is the semantic context, the meaning which the programmer wanted to convey to the user with the given message. When not entirely clear from the text alone, the programmer may supply this kind of context manually, as an extracted comment:

#. TRANSLATORS: The zoom factor per press when moving
#. to the right in compass mode.
#: ../Src/DasherCore/CompassMode.cpp:30
msgid "Right zoom"
msgstr ""

#. TRANSLATORS: this is the name of a new file which
#. has not yet been saved
#: ../meld/filediff.py:780
msgid "<unnamed>"
msgstr ""

#. TRANSLATORS: first letter of "Space", a serial port parity setting
#: ../src/Settings.vala:85
msgid "S"
msgstr ""

This is written like this in the code:

// TRANSLATORS: The zoom factor per press when moving
// to the right in compass mode.
gettext("Right Zoom")

The TRANSLATORS: keyword is what signals xgettext to extract the comment into the PO file.

Another, more technical type of context, is the disambiguation context. It is used to split two instances of the same original text into two different messages, when in some language different translations may be required for those two instances. This context is expressed as msgctxt message field (which forms a part of the message key in the catalog). In this example:

#: ../src/gdu-gtk/gdu-gtk.c:220
msgctxt "application name"
msgid "Unknown"
msgstr ""

#: ../src/gdu/gdu-util.c:940
msgctxt "connection name"
msgid "Unknown"
msgstr ""

the adjective "unknown" in translation may need to conform to the gender of the noun it refers to, and the gender of "application" and "connection" may not be the same. In the code, this is done using pgettext calls:

pgettext("application name", "Unknown")
pgettext("connection name", "Unknown")

Usually, when necessary, the disambiguation context suffices as the semantic context as well.

Could there be another conceptual type of context next to the semantic and disambiguation context? Consider the following messages equipped with extracted comments (i.e. semantic contexts):

#. TRANSLATORS: this is a verb (command), not a noun (things)
#: ../src/Clients/Nereid/Nereid/PlayerInterface.cs:472
msgid "Search"
msgstr ""

#. TRANSLATORS: this is a noun, referring to the harddisk
#: ../src/Core/Banshee.Services/Banshee.Sources/PrimarySource.cs:214
msgid "Drive"
msgstr ""

#. TRANSLATORS: this is the title of the linking dialogue [...]
#. "Link" in this title is a verb.
#: ../libempathy-gtk/empathy-linking-dialog.c:115
msgid "Link Contacts"
msgstr ""

The problem with these comments is that all mentions of "noun" and "verb" are fluff. You may be surprised at this claim, since it is often stated that such comments are exactly what is needed. I will say it outright: if the translator needs to know if something in the original text is a noun or verb, then he is translating it wrongly. The reason is that what is expressed with one grammar form in one language, places no obligations onto a grammar form used in another language. In the examples above, I still wouldn't know hot to translate "Search", and the same holds for "Drive" (although it was certainly useful to confirm that it is a storage device). However, the "Link Contacts" message I would instantly know how to translate, because although it contains the useless "verb" note, it contains the right clue as well: it is the title of the [...] dialogue; and there would be no verb in my translation of this message, only nouns.

Think a bit about how the programmer chooses (or at least should choose) the wording of the original text: by looking into the "human interface guidelines" (HIG) document. This document is usually prepared for larger projects or coding frameworks, but even when there is no formal document, there are at least established conventions in the given body of code. The HIG states things of the sort "Button text should be written in imperative voice, and with title capitalization, for example 'Close All Files'...". What was that bit about "title capitalization"? This particular HIG excerpt was written with English in mind, and many translators will quickly realize that title capitalization is not used in their language, and use sentence capitalization instead ("Close all files") in translation. However, many translators fail to make out the full implication: it is necessary to adapt the HIG to the target language. And then, just as the the source language HIG states what grammar and style forms are to be used per UI element context, so should the target language HIG.

For example, in my language, titles normally do not contain verbs (unless they are a complete sentences). Instead they contain "nounized" verbs, because it is very natural to form a noun out of a verb. Therefore, the HIG for my language should state "Button text should be written in singular imperative voice..." and then "Title labels (of windows, dialogs, tabs) should be in noun form...". This means that for the "Search" message, what I really needed to know as a translator was whether it was a button or a title text; for the former the translation would have been "Traži" and for the latter "Pretraga".

This consideration leads to proposing interface contexts: formalized short strings that programmers would add to reveal to translators where in the user interface the message is used. An interface context string would be composed of major and minor component, as @major:minor. Here is one possible set of components, tailored for graphical UI programs:

major component minor component
@action

Labels of clickable widgets which cause an action to be performed.

:button

Push buttons in windows and dialogs.

:inmenu

Menu entries that perform an action (as opposed e.g. to being checked).

:intoolbar

Toolbar buttons.

@title

Text that forms a title of a major interface element or a widget container.

:window

Title of a window or a (dockable) view/pane.

:menu

Menu title.

:tab

Tab name.

:group

Title to a group of widgets, like a group of checkboxes or radio buttons.

:column

Column name in a table header, e.g. in a table view widget.

:row

Row name in a table.

@option

Labels of option selection widgets, which can be enable/disabled or selected between.

:check

Checkbox label, also a checkable menu entry.

:radio

Radio button label.

@label

Various widget labels which are not covered by any of @action, @title, or @option.

:slider

Slider label.

:spinbox

Spinbox label.

:listbox

Label to a list box or combo box.

:textbox

Label to a text box or text edit field.

:chooser

Label to any special chooser widget, like color chooser, font chooser, etc.

@item

Strings that are items from a range of possibilities or properties of the same general type.

:inmeny

An item presented in menu (e.g. a sort ordering, or an encoding).

:inlistbox

An item presented in a list box.

:intable

An item presented in a table cell.

:inrange

End range labels, like to sliders.

:intext

Words and short phrases which are inserted into a larger piece of text.

@info

Any transient information for the user.

:tooltip

Expanded formulation of a widget's label, usually appearing automatically when the pointer hovers over the widget.

:toolhelp

Longer description of a widget's purpose and behavior, usually manually called up by the user.

:status

A piece of text displayed in program's status view, like in a status bar.

:progress

Text showing the current step or state of an operation, possibly periodically updating.

:usagetip

A tip that comes up to inform the user about a certain possibility in the current context, e.g. a "tip of the day" on program startup.

:credit

Contributor names and their contributions, e.g. in the about dialog.

:shell

A note, warning or error sent to program's text output stream (standard output, standard error) rather than shown in the UI.

When none of the minor components apply, a major component can be used alone. An example would be a library-provided list of items without any immediate UI context (e.g. language names, country names, etc.), which would be given plain @item context.

Unlike markup, the set of interface context components would not be free for customization, but fixed by Gettext and periodically updated. This is because the point of formalized interface contexts is that translators can map them to HIG-like specifications, which are general in scope, so per-domain interface contexts would serve no purpose.

Where should interface contexts be seen in the PO message? They could be added into extracted comments or into msgctxt field. The proposition is to put them into msgctxt field. One reason is that if an interface context is modified, it should really make translators revisit the message (by making it fuzzy). Another reason is that if the same label is shown in two different interface contexts, it is not unlikely that in some languages it will need different translations. Such conflicts are even more likely when many PO files are combined in some way, for example when making a PO compendium (i.e. a translation memory). For example, in my language:

msgctxt "@action:button"
msgid "Confirm Delete"
msgstr "Potvrdi brisanje"

msgctxt "@title:window"
msgid "Confirm Delete"
msgstr "Potvrda brisanja"

When an interface context occupies the start of msgctxt field, any necessary disambiguation context can simply be appended separated by space:

msgctxt "@item:inlistbox Text width"
msgid "Small"
msgstr "uzak"

msgctxt "@item:inlistbox Grid spacing"
msgid "Small"
msgstr "mali"

A third reason why interface context should be in msgctxt will be provided shortly.

How much would interface contexts inflate the amount of text to translate? First, an increase happens only when there are two messages with same text but different interface contexts within the same PO file. Second, when that does happen, it is improbable that such message will have more than a few words. This means that the increase is likely to be small, especially in the usual metric for estimating translation effort, the word count. Interface contexts have been used for a number of years within the KDE Translation Project, and inspection shows that, in PO files heavily equipped with interface contexts, the increase in word count is in the range 1%-6%. The average increase for all PO files which have over 80% of messages equipped with interface contexts is 2.7%.

5.1. Deciding Target Markup by Interface Context

You may have noticed that, unlike all other proposed extensions, interface contexts as described so far are pure convention. They are in no way taken into account at run time, and the only "implementation" would be documenting available major and minor components. However, interface contexts can be optionally given one technical function.

In the section on customizable markup, it was discussed how to select the desired target markup when converting the translation object into a native string. Two ways were offered. The ideal way was automatic selection, by having smart output destinations (widgets, output functions) which themselves convert to native string and set the appropriate target markup. Since smart output destination are more likely not to be available, the other proposed way was manual selection. To recollect, for the three possible variants of conversion to native string, manual target markup selection went like this:

// C++/Qt

// Explicit conversion: target markup by the conversion method.
QString msg1 = xgettext("Processing <fn>{1}</fn>...")
                       .subs(filePath).toRich();

// Implicit conversion: target markup by a method before conversion.
QString msg2 = xgettext("Processing <fn>{1}</fn>...")
                       .setRich().subs(filePath);

// Immediate resolution: target markup by a dummy argument.
QString msg3 = ixgettext("Processing <fn>{1}</fn>...",
                         Gettext::Composer::Rich, filePath);

The third way to select target markup is by assigning the default target markup for each interface context. This makes sense because, for example, all @action:button texts in a program will be rendered by the same button widget, all @item:inlistbox by the same list box item widget, and so on. Target markups and interface contexts would be linked at the same place where the markup was set up:

// C++/Qt

void setupMarkup ()
{
    typedef Gettext::Composer GTC;

    const char *domain = "fooapp";

    GTC::markupDef(domain, "plain", "emph", ...);
    GTC::markupDef(domain, "qtrich", "emph", ...);
    GTC::markupDef(domain, "plain", "path", ...);
    GTC::markupDef(domain, "qtrich", "path", ...);
    ...

    GTC::markupSetClassPlain(domain, "plain");
    GTC::markupSetClassRich(domain, "qtrich");

    const char *plainUics[] = {"@action", "@option", "@label",
                               "@title", "@item",
                               "@info:status", "@info:shell", 0};
    GTC::markupSetUicPlain(domain, plainUics);
    const char *richUics[] = {"@info", 0};
    GTC::markupSetUicRich(domain, richUics);
}

Linking is done by the markupSetUicClass(domain, uicontexts) function for the particular target markup class. When an interface context is given only by the major component, that means to apply the target markup to all interface contexts with that major component and any minor component, except for major-minor combinations that were explicitly given. In this example, all major interface contexts except for @info were set to plain text class, as well as some particular @info:... contexts, and all other @info contexts were set to rich text class (which is here declared to be Qt Rich Text).

For this to work, interface context must be part of msgctxt message field, or else it could not be seen at run time. Translation object will check whether its context argument contains the interface context, and if so, it will try to set the target markup accordingly. If the given interface context was not linked to any target markup, target markup will remain at its global (domain) default. The above conversion example would now look like this:

// C++/Qt

// Explicit conversion.
QString msg1 = pxgettext("@info:progress",
                         "Processing <fn>{1}</fn>...")
                        .subs(filePath).toString();

// Implicit conversion.
QString msg2 = pxgettext("@info:progress",
                         "Processing <fn>{1}</fn>...")
                        .subs(filePath);

// Immediate resolution.
QString msg3 = pixgettext("@info:progress",
                          "Processing <fn>{1}</fn>...",
                          filePath);

Other than the added context, all conversion variants now look exactly the same as they would without markup in the picture (explicit conversion is done by the general .toString method, which respects any previously set target markup). Of course, target markup selected based on context can later be overridden by any of the manual selection methods.

When markupImport function is used to import markup from another domain, any context-markup links defined by that other domain would also be imported. So, by calling markupImport(domain, "gengettext"), one would at the same time get access both to the generalized Gettext's library built-in set of tags and to the built-in context-markup links:

// C++/Qt
void setupMarkup ()
{
    typedef Gettext::Composer GTC;
    const char *domain = "fooapp";
    GTC::markupImport(domain, "gengettext");
    GTC::markupSetClassPlain(domain, "plain");
    GTC::markupSetClassRich(domain, "qtrich");
}

Given that most appropriate defaults for target markup per target markup class would be defined for each coding framework binding (e.g. rich text would be set to "qtrich" in Qt bindings and to "pango" in GTK+ bindings) the absolute minimal example would be a single markupImport call:

// C++/Qt
#include <libintl_qt.h>
// ...coding framework specific bindings in use.

void setupMarkup ()
{
    Gettext::Composer::markupImport("fooapp", "gengettext");
}

Just like any of the tags can be overridden after an import from another domain, so can any of the context-to-markup links.

6. Translation Scripting

If programmers stick to a few base rules when exposing user interface text for translation, the result is quite a workable material for translation into any language. Among the most important of these rules is to avoid "word puzzles" (concatenating or interpolating full sentences from pieces) and to supply context as necessary (of the appropriate type -- semantic, disambiguation, etc). The fact that translators can suggest improvements (e.g. advise adding context) and frequently do so, further smooths out the current Gettext application.

However, there are cases where keeping up with base rules of coding for translatability would make the code much more cumbersome, reduce its modularity, and lead to a huge increase in workload without benefit for many languages. In fact, the increase in workload can be prohibitive even for the very languages which would quality-wise benefit from it. On the other hand, in many of these cases, it is clear what the programmer could do to make added effort for translation into the given language as small as possible. The problem is that such an adaptation would be superfluous for other languages, and that other languages would need other adaptations, leading to a convolved patchwork most of which would be dead weight to everyone involved.

Consider an example of a base UI widget library and a program that uses it. The UI library has a standard dialog to show information about the program, and the title of this dialog is composed through this message:

# TRANSLATORS: %s is a program name
#: uilib-about-dialog.cpp:130
#, c-format
msgid "About %s"
msgstr ""

This message violates the base rule of avoiding word puzzles, because it splits in the program name. Depending on the program name and language, the program name may need to slightly change form (e.g. noun declension) or the surrounding parts may need to conform to the name (e.g. its gender or number). But, the program name argument comes from the program that uses the library, where it is exposed for translation with this message:

#: fooproc.cpp:130
msgid "FooProcessor"
msgstr ""

To resolve this word puzzle, it would be necessary to do away with automatic assembly of the dialog title by the UI library, and require that all programs specify the full title themselves:

#: fooproc.cpp:130
msgid "FooProcessor"
msgstr ""

#: fooproc.cpp:135
msgid "About FooProcessor"
msgstr ""

This leads to a loss of modularity. Another solution is to ask programs to explicitly specify the form of the application name appropriate for "About..." messages (which may appear elsewhere too, e.g. as a menu item):

#: fooproc.cpp:130
msgid "FooProcessor"
msgstr ""

#: fooproc.cpp:135
msgctxt "used in 'About <appname>' phrase"
msgid "FooProcessor"
msgstr ""

This is effectively language-specific patchwork, because many languages do not need it and it still does not provide for some languages (which may need to match their equivalent of "About..." to a grammar property of program name). Note that both solutions require modifications to the public API of the UI library, which alone would probably make them both be rejected. Translators of affected languages therefore have to resort to a "least bad" translation -- usually something correct strictly grammar-wise, but clearly of bad style.

As the second example, recall the fuzzy clock case from the introduction. It had 12 hour name messages, and 12 fuzzy time composition messages into which hour names are put at run time. Replacing runtime composition with literal combinations would have been feasible there, since it would result in only 144 messages. The fuzzy clock example was based on a real-life application, only abridged for presentation. Here is one entirely real-life example of the same kind, from a card game:

#. Both %s are card names
#. The first %s is a card name, the 2nd %s a sentence fragment.
#. * Yes, we know this is bad for i18n.
#.
#: ../src/game.c:2103 ../src/game.c:2129
#, c-format
msgid "Move %s onto %s."
msgstr ""

The first argument is a card name, like "queen" or "jack of spades". The second argument can also be a card name, or a place reference like "an empty top slot" or "the foundation pile". How many messages would it take to avoid this composition? There are 52 cards, so card-onto-card variant would result in at least 2704 messages (more if ranks without suits are also taken into account); there are 16 places giving further 832 messages, but the set of places can in principle be expanded in the future. Hence the programmer's surrendering comment.

As the final motivational example, consider practically unbounded sets of arguments:

#. TRANSLATORS: %s is a date
#, c-format
msgid "Events on %s"
msgstr ""

The date can be something with numbers only, with month name (full or shortened), and the weekday may be in there too. This date is composed based on the user's locale (names and formats), and possibly even user's custom settings. Just like any other noun insertion, words in the date may need to change form based on the rest of the sentence. With current Gettext, there is no way to "fix" this message.

All these examples, however, are trivial to translate properly instead of "least badly", without any modification to the code or affecting other languages, if the translator can make the translation depend on arguments supplied at run time. In other words, if the translator can script the translation. In my native language, and switching now to uniform placeholders, the first example would be handled so:

#: uilib-about-dialog.cpp:130
#, ggx-format
msgid "About {appname}"
msgstr "Podaci o {appname}"
msgscr "Podaci o $[get-property 'relative-case' {appname}]"
#: fooproc.cpp:130
#, ggx-format
msgid "FooProcessor"
msgstr "FuProcesor"
msgscr "$[set-property 'relative-case' 'FuProcesoru']"

This is more verbose than it would be in reality, so that you can intuitively understand what is going on. Similarly, the second example would be translated as:

#, ggx-format
msgid "Move {card} onto {card-or-place}."
msgstr "Stavi {card} na {card-or-place}."
msgscr ""
"Stavi $[get-property 'object-case' {card}] na "
"$[get-property 'object-case' {card-or-place}]"

with corresponding $[set-property 'object-case' ...] scripting calls on card and place name messages. The third case would look like this:

#, ggx-format
msgid "Events on {date}"
msgstr "Događaji na {date}"
msgscr "Događaji $[on-date {date}]"

The $[on-date] call here would be completely specific to my language, made to put parts of the date into a proper declension and select the appropriate preposition (which depends on whether the date starts with a weekday or not).

The rest of this section discusses the syntax and semantics of the new msgscr field (the translation script), where scripting calls ($[callname]) would come from, and a few other bits that arise from the scripting capability.

6.1. Translation Scripts in Messages ("PO Shell")

Translation scripting would introduce new syntax to the PO format: the msgscr field. This field would be manually added by the translator when necessary, to provide a scripted translation in addition to ordinary translation in msgstr field.

msgscr would contain text of the translation just as msgstr, but segments of the form $[callname arg1 arg2 ...] would be interpreted as scripting calls. This is similar to Unix shell command expansion syntax $(...), hence the approach is called the "PO shell". The scripting call consists of a call name followed by whitespace-separated list of arguments. Arguments can be literal strings, or message argument placeholders, which are substituted before dispatching the call. In the msgscr of this message:

#, ggx-format
msgid "No configuration available for {applet}."
msgstr "{applet} nema ništa za podešavanje."
msgscr "{applet} $[po-broju {applet} nema nemaju] ništa za podešavanje."

po-broju is the call name, {applet} is a placeholder which will be substituted to become a string argument, and nema and nemaju are two literal string arguments[21].

If a literal string argument needs to have whitespace in it, it can wrapped in single quotes. If a placeholder is placed next to a literal string, the substituted value and the literal string will form a single argument. Everything in the substituted value is treated as literal text. It will not happen, for example, that a single placeholder substitution produces several call arguments because it contained some whitespace. To put it together, this call:

msgscr "... $[acall foo{bar}'baz qwyx'fum] ..."

has only one argument, no matter the value substituted for {bar}. Single-quoting does not prevent placeholder substitution. These two calls have the same single argument:

msgscr "... $[acall 'foo {bar} baz'] ..."
msgscr "... $[acall 'foo '{bar}' baz'] ..."

Scripting calls can be nested within scripting calls, for example:

msgscr "... $[if $[check-something ...] then-value else-value] ..."

A scripting call can be nested even at the call name position of the outer call, so that the string it returns is treated as the call name of the outer call.

A scripting call is considered properly executed if it returns a string (even an empty one). In that case its return value is concatenated to previously parsed segments, and parsing of msgscr continues. Once msgscr is parsed to the end, all collected segments (literals and call return values) are concatenated to produce the final translation of the message. If a scripting call returns something that is not a string or if it aborts, interpretation of the whole msgscr is aborted, and ordinary translation (msgstr) is taken as the final translation of the message.

Escaping is done as follows. The call start sequence $[ can only be escaped by inserting an empty call $[], giving $$[][. Empty call always returns empty string. This is preferable to introducing special escaping syntax because the probability of literal $[ in a message multiplied by the probability of scripting being needed for that message is extremely low. Within the call, closing square bracket and whitespace are escaped by putting them in single quotes, and single quote is escaped by doubling it. Braces, being placeholder delimiters, are escaped by doubling them (just like outside of a call).[22]

Within msgscr, another argument placeholder is automatically made available: {0}. It evaluates to the ordinary translation, which may come in handy in some situations. For example, the set-property scripting call mentioned earlier is a bit magical, in that it is nowhere told on which string to set the property (unlike get-property, which is given both the string and the property name); set-property will automatically pick the ordinary translation as the string to set the property on. An explicit, non-magical variant set-property-of would be applied like this:

#, ggx-format
msgid "FooProcessor"
msgstr "FuProcesor"
msgscr "$[set-property-of {0} relative-case FuProcesoru]"

Sometimes it might be necessary that a placeholder substitution does not result in a string value for scripting call argument, but in a value of type "nearest" to the actual type of the substituted message argument. This would mostly apply to substitution of numbers, since as a string the number may have locale thousands separator and decimal comma, padding, and whatever else the coding framework provides, making it hard to convert it back to an integer within the scripting call definition. Nearest-type substitution is requested by adding the # extensions to the placeholder. For example, when the programmer knew that the substituted number will always be greater than 1 and therefore assumed that a plural call is not needed, the translation could be scripted[23] like this:

#, ggx-format
msgid "{num} bytes"
msgstr "{num} bajtova"
msgscr "{num} $[plural-form {num#} bajt bajta bajtova]"

For precisely defining what "nearest type" means it must be known which types a scripting call can take, and that depends on how scripting calls would be defined. More on that in the section on sources of scripting calls.

When a plural message needs to be scripted, each msgstr[i] can be paired with a corresponding msgscr[i]. They would be ordered in the PO file as msgstr[0], msgscr[0], msgstr[1], msgscr[1]... Pairing would not be mandatory for all plural forms in the given message, since it may happen that not all of them need to be scripted.

6.2. Why Keep Ordinary Translation?

There are several reasons why msgstr field would always be kept, rather than substituted with msgscr. Or, from the other side, why msgstr would not be made scripting-sensitive itself instead of introducing msgscr. One reason was already mentioned: in case the evaluation of msgscr fails, msgstr is used as fallback translation.

The second reason is that failure of msgscr does not have to be due to an error on the part of translator, but may be by design. For example, in:

msgscr "... $[get-property fooprop {bararg}] ..."

it may be that only some arguments have the requested property set, and the scripted translation applies to them, where for other arguments the ordinary translation is fine. A more extreme example is property setting itself:

#, ggx-format
msgid "FooProcessor"
msgstr "FuProcesor"
msgscr "$[set-property relative-case FuProcesoru]"

where the scripted translation always fails, as the call is executed for side-effect of setting the property on ordinary translation.

The third reason is configurability. As will be shown later, translation scripting requires a scripting language interpreter to run within the process, which in some special environments may be too heavyweight or considered a security risk. Generalized Gettext library may then be compiled without scripting support or the interpreter may be globally disabled, and ordinary translations will still be available.

The fourth, final reason for always keeping msgstr is backward compatibility: if needed in some context, a compiled PO file (MO file) with scripted elements can be used with classic Gettext calls. This would be done by making two MO file entries for each scripted message. One would be an ordinary entry, composed of msgid value and msgstr value, as if msgscr weren't present. The other entry would be msgid value with a special tail (e.g. one '\x04' character[24]) and msgscr value. This would make a classic Gettext call succeed on ordinary entry, while a generalized Gettext call would look for both entries.

These reasons notwithstanding, it could actually be allowed to remove msgstr and leave only msgscr on a message. This is because the underlying assumption so far was that only a very small number of messages (say less than 1%) will need to be scripted. If this assumption would not hold, if a significant portion of messages, especially longer messages, would need to be scripted, than the great duplication of text would probably trump all the proposed advantages of keeping the msgstr. One of Gettext's tools would provide to convert all msgstr to msgscr in a given PO file; msgmerge would add msgscr instead of msgstr on new messages if it would detect that the majority of old messages have only msgscr. If a scripting call aborts, the script language interpreter is disabled, or the MO file is used for classic Gettext calls, messages that had only msgscr would simply appear as untranslated.

6.3. Sources of Scripting Calls

Scripting calls useful for many languages, such as set-property and get-property, could be implemented within the generalized Gettext library and in its core programming language, i.e. C (or C++). This could also be the done for calls which are specific to one language, but widely useful within it (in many PO domains). Language-specific calls would live in separate namespaces, so that the same call name would have different behavior in different languages.

Stopping at this, however, would lead to two problems. The first problem is that C/C++ may be tedious for adding and testing new calls, requiring building and running development version of the library. This is especially problematic for language-specific calls, where the maintainer of the generalized Gettext library may not understand the exact language-specific functionality needed. The second problem is more important: it would not be possible to define scripting calls specific to a certain PO domain. This is something that a translator with basic programming skills might want to do, and there is a non-negligible number of such translators in the free software translation environment. Given the usual amount of cooperation in free software projects, even a translator without any programming skill might find someone to implement a special scripting call.

It should therefore be technically and organizationally simple to add and modify scripting calls, both within the generalized Gettext library and per PO domain.

To achieve the technical simplicity, generalized Gettext library would provide a scripting language for defining scripting calls, and only those calls that need high performance and controlled memory use would be implemented in C/C++. This scripting language should be a well-known general-purpose language, rather than a custom one. It should be simple to program in, lightweight in terms of memory footprint and interpreter runtime dependencies, and designed for sandbox operation. Two candidates come to mind: ECMAScript (better known through JavaScript as its main dialect) and Lua.[25]

Lua is the smaller of the two, with a MIT-licensed implementation designed for embedding. It is particularly frequently used for scripting in computer games. Unfortunately, Lua is too small. The biggest problem is that it does not have a Unicode string type, but treats strings as raw byte arrays. An external string type would have to be provided instead, which would look rather dirty. Another problem is that it defines its own pattern-matching microlanguage, rather than using POSIX/Perl regular expressions.

This leaves us with JavaScript, which is quite popular for scripting web pages, and is used more and more as the embedded scripting language in larger programs and environments in general. There are several free implementations of the basic JavaScript interpreter, like SpiderMonkey (MPL/GPL/LGPL-licensed), QtScript (LGPL-licensed), or KJS (LGPL-licensed).

For scripting calls provided by the generalized Gettext library, it is not very important where their definitions would reside. There would obviously exist one directory for general calls, and one directory per language for language-specific calls; everything beyond that is code layout detail. It is much more important to decide where calls per PO domain would be placed, because that determines the organizational simplicity of implementing domain-specific calls.

From the installation tree point of view, the proposition is to introduce a separate root directory for scripting call definition files at the same level as LC_MESSAGES of the given language, called LC_SCRIPTS. This root directory would contain one directory per each PO domain that has some custom scripting calls defined, and within that directory a file named main.js. For example, if the PO domain is fooapp, the installation tree would look like this:

$PREFIX/share/locale
    aa/
        LC_MESSAGES/
            fooapp.mo
        LC_SCRIPTS/
            fooapp/
                main.js
    bb/
        LC_MESSAGES/
            fooapp.mo
        LC_SCRIPTS/
            fooapp/
                main.js
    ...

The reason for having a domain-named subdirectory instead of a single file (fooapp.js) is that, in general, more than one scripting file, or files other than scripting files, may be necessary for a single PO domain (an example of this will be shown later). This subdirectory would be called a scripting module

As for the source tree of a Gettext-using program, scripting modules could be placed like this:

fooapp/
    doc/
    po/
        LINGUAS
        aa.po
        bb.po
        ...
        scr/
            aa/
                main.js
            bb/
                main.js
            ...
    src/
    ...

Another possibility is to collect scripting modules into another top-level subdirectory, say poscr/.

6.4. Translation Object Model

When JavaScript is used for scripting web pages in a browser, it is extended with a "browser object model" (BOM), a set of global objects for accessing the web page source (HTML code) and browser internals (shortcuts, tabs, etc.) Similarly, we have to define a "translation object model" (TOM) through which JavaScript code in a scripting file can connect with the PO shell.

The proposal is to have the following global objects:

message

If the top entry point for current execution context was a PO message, i.e. a scripting call in its msgscr field, then the message object will provide methods to query details of that message. This includes message text fields (msgctxt, msgid, msgstr), but also the arguments which were supplied as placeholder substitutions. If the top entry point was not a PO message, like when the scripting definition file is loaded, message would be set to undefined.

domain

The domain object would provide information and functionality linked to the current PO domain, i.e. the domain which was the top entry point for the current execution context. For example, property setting and getting calls would use the domain object for storage (or as key) in order to avoid namespace clashes, and PO shell call names would be linked to JavaScript functions through this object. There should be no execution context which does not have a PO domain as the top entry point, but should such a context appear, domain would be undefined in it.

header

Similarly to the message object, the header object would provide information from the PO header of the PO domain of the current execution context, but in a more fine-grained way than simply presenting its msgstr string. This basically means providing the array of header field name-value pairs, and some convenience querying methods (e.g. get the array of field values for the given field name).

gtcore

The gtcore object ("Gettext core") would provide any general functionality needed in addition to JavaScript's core functionality. This would include methods to compose strings, to output debugging information when testing scripting calls, to query the environment (e.g. locale settings), and so on.

As an example of employing the TOM, recall the set-property call from previous examples, which set a property on the string given by msgstr field. Here is how this call could be defined, to set an arbitrary number of properties:

/* Set properties of the phrase given by msgstr.
 * Always signal fallback.
 */
function set_msgstr_property (/*...*/)
{
    if (arguments.length % 2 != 0)
        throw Error("Property setter given odd number of arguments.");

    phrase = message.msgstr();

    for (var i = 0; i < arguments.length; i += 2)
    {
        var prop = arguments[i];
        var value = arguments[i + 1];
        domain.set_phrase_property(phrase, prop, value);
    }

    throw gtcore.fallback();
}
domain.set_call("set-property", set_msgstr_property);

Every function in JavaScript is variadic, in the sense that when the function is called, any declared parameters in the header will get arguments assigned to, and both those assigned and any remaining call arguments will be accessible through the arguments array. Therefore here the /*...*/ comment was put into the header to indicate intentional variadicity. The first thing is to assure that there is an even number of arguments (property keys followed by values), or else throw an exception. Then, the phrase on which the properties are set is taken as current message's msgstr field, from the message object. Within the property setting loop that follows, domain.set_phrase_property method is used to link the phrase and property key and value in the current PO domain. Since set-property is a side-effect scripting call, at the end an exception is thrown to signal the fallback to msgstr. But, if a JavaScript Error object would be thrown, that could register as a warning somewhere, in a debug mode or so; instead, a special object produced by gtcore.fallback method is thrown, which causes the fallback to be silent. After the function body, domain.set_call method is used to link the set_msgstr_property function name to the PO shell set-property call name.

The corresponding get-property method would be defined more simply as:

/* Get value of the phrase property.
 * If the property is not set on the phrase, signal error.
 */
function get_phrase_property (phrase, prop)
{
    value = domain.get_property(phrase, prop);
    if (value == undefined)
        throw Error(gtcore.compose(
            "Phrase '{1}' does not have the property '{2}' defined.",
            phrase, prop));
    return value;
}
domain.set_call("get-property", get_phrase_property);

The only interesting thing here is the use of gtcore.compose method to assemble the error message. JavaScript lacks a standard string formatting method, so gtcore object would provide a reduced version of generalized Gettext string composition functionality.

Call names given to domain.set_call can be any Unicode strings, so translators may decide to use language-specific call names (including non-Latin scripts) to blend in more elegantly into the translation in the PO file. There is no standard way to specify the encoding of a JavaScript source file, so it would be required that scripting files are UTF-8 encoded.

6.5. Modularity and Scoping

Since scripting calls and underlying JavaScript functions would be found in the generalized Gettext library and in various PO domains, the questions of modularity and scoping arise.

Scripting file modularity comes into picture when in the current scripting file one wants to use a definition (function, object, variable) from another scripting file. The other scripting file may be within the same PO domain, in another PO domain, or in the generalized Gettext library; it may also belong to another language[26]. JavaScript itself does not define any kind of module system, so there is no convention to follow. The proposition is to have a single gtcore.import method, which would work as follows. The function afunc would be imported from file afile.js in the current PO domain with:

gtcore.import("afile", "afunc")

This call would make afunc available for calls anywhere in the current scripting file, no matter in which scope it would be executed. gtcore.import would actually take any number of names to import:

gtcore.import("afile", "afunc1", "afunc2", "AnObj3", ...)

To import from a file within a subdirectory of the current domain, dotted path would be used:

gtcore.import("asubdir.afile", "afunc")
gtcore.import("asubdir.asubsubdir.afile", "afunc")

Importing from another domain would work by prepending the domain name with a colon:

gtcore.import("adomain:afile", "afunc")
gtcore.import("adomain:asubdir.afile", "afunc")

To import from another language, language code and percent sign would be prepended to domain or file path:

gtcore.import("alang%afile", "afunc")
gtcore.import("alang%adomain:afile", "afunc")

Importing from the generalized Gettext library would be done in the same way, by supplying its domain name gengettext.

One problem here is that Gettext has no mechanism of searching through various locale directories for domains. Instead, the locale directory of a program is configured at build time, making it hard-coded in the installation. This presents a problem for importing definitions from domains other than the current and gengettext (since generalized Gettext library would bind its own locale directory), when programs are installed in different prefixes. One solution could be to introduce something like GETTEXTPATH variable, which would take precedence over build-time bound locale directory. In my opinion this would be a welcome addition even to ordinary Gettext[27], but it could be limited only to use within generalized Gettext (it would also be needed for domain resolution in connection with customizable markup).

Scripting call scoping refers to the mechanism by which the underlying function would be selected for a call name parsed from msgscr string in a PO message. Obviously, if the PO domain to which this message belongs has its own scripting module, then any domain.set_call in it would have top priority:

domain.set_call("a-call", a_call_func)

Technically it would be possible to stop here, and rely on the scripting file modularity for the rest. For example, to set a call to a function from generalized Gettext library, one could add this in domain's scripting module:

gtcore.import("gengettext:main", "a_call_func")
domain.set_call("a-call", a_call_func)

But this would imply that whenever a translator wants to use scripting in the PO file, even if core scripting calls (those from the generalized Gettext library) would be completely sufficient, a scripting module with lines such as above would have to be added. Since most of the time core calls should be sufficient, having to add domain-specific scripting modules all the time would be too burdensome.

Instead, there would be gtcore.set_global_call method, which would make the call available in any PO domain within the program process, but limited to language of the message (i.e. no global calls across languages). Global call setting should be used sparingly, because it would mask dependencies between PO domains. It would most prominently be used in scripting modules of the generalized Gettext library, but it could also be reasonably used in base libraries of various coding frameworks. When a call name is seen in the PO message, first domain-specific calls would be checked for a match, and then global calls.

Another possibility could be to explicitly import another domain's calls through a PO header field:

msgid ""
msgstr ""
...
"Scripting-Call-Source: foodomain, gengettext;\n"

The order of domains would determine the scoping order, in cases when two domains set a call with the same name. This is much less burdensome than adding a scripting module for the sole purpose of importing calls, but it still requires that the translator remembers to do it. It also has the same issue with locating locale directories as mentioned above.

6.6. Filtering Messages and Runtime Contexts

Once translation scripting is available, two interesting "second-order" possibilities appear: filtering messages and runtime contexts.

Consider the UI scenario where a menu item expands into a submenu, such that the parent menu item and submenu items imply a continued phrase. Two examples:

File
    -> New
    -> Open
    -> Open With
        -> FooViewer
        -> BarShow
        -> ...
(context-menu)
    -> Open
    -> Cut
    -> Create New
        -> Folder
        -> Text File
        -> Application Link
        -> ...

Implied phrases here are e.g. "Open With FooViewer" or "Create New Text File". In languages with case declension, the submenu items would probably look better if put into a non-basic case. If submenu items are also used elsewhere where the basic case is appropriate, it is not possible to simply translate them using the appropriate non-basic case. With scripting, as we have seen, items could be given properties:

msgid "Application Link"
msgstr "Veza do programa"
msgscr "$[set-property object-case 'vezu do programa']"

But, where would that property be fetched? Normally, translated items would be simply inserted into the "Create New" submenu. The solution is to insert items through a filtering message, which consists of a single placeholder and an appropriate context:

// C++/Qt
foreach (const QString &ft, creatableFileTypes) {
    QString ft1 = pxgettext("@item:inmenu Create New",
                            "{filetype}")
                           .subs("filetype", ft).toString();
    createNewFtMenu->addAction(ft1);
}

or in the PO file, with scripted translation:

msgctxt "an item in Create New -> ... menu"
msgid "{filetype}"
msgstr "{filetype}"
msgscr "$[get-property object-case {filetype}]"

Filtering messages would normally be added on translator's request. It would never be a problem to add a filtering message, even when it helps only one language; from the code viewpoint it would be a no-op, and from the translation viewpoint trivial to translate. Filtering messages can be added even during a message freeze, since they do not introduce any new text.

In the two examples above, the UI context was such that a phrase segment (a file type) was inserted into a static implied phrase ("Create New ..."). In other words, the only variable part of the implied phrase was the segment being inserted. However, some UI contexts can lead to implied phrases whose parts depend mutually on one another. For example, in a file search dialog there can be options such as this:

[ ] Find all files created or modified:
    ( ) between ...
    ( ) during the previous ____n<> [hours|days|...]

Here [ ] denotes a checkbox, ( ) a radio button, ____n<> a number spinbox, and [hours|days|...] a listbox with those items. Already with current Gettext, the programmer can properly pluralize listbox items, by providing that they are retranslated whenever the number in the spinbox changes. The same can be done with "during the previous" text label in front, because the adjective "previous" may need to match the plurality of the noun in listbox. But, the adjective "previous" may also need to match the grammatical gender of the noun in listbox. Neglecting pluralization for brevity, if the PO message of this text label is:

msgid "during the previous"
msgstr ""

there is no link with the noun in the listbox, and therefore no way for translator to script conformance to noun's gender.

To establish the missing link, the translator can ask the programmer to retranslate this message on every change of the listbox item, and to set on that message a runtime context which indicates the current item in the listbox. A runtime context is simply a pair of two strings, the context keyword and runtime value. It would be added by the inContext method of the translation object:

// C++/Qt
QString prLabel = pxgettext("runctxt: period=[i|h|d|m|y]",
    // TRANSLATORS: This message has the runtime context 'period'
    // with values: 'i' for minutes, 'h' for hours, 'd' for days,
    // 'm' for months, 'y' for years.
                            "during the previous")
                           .inContext("period", periodKey).toString();

The message has a comment informing translators of the runtime context, its keyword and possible values, but also a semi-formal disambiguation context which succinctly states the context keyword and values. This is necessary because when a runtime context is modified or added to a message, translators must unconditionally review that message (same as when the ordinary context is added).

The translator can now script the translation as follows:

#. TRANSLATORS: This message has the runtime context 'period'
#. with values: 'i' for minutes, 'h' for hours, 'd' for days,
#. 'm' for months, 'y' for years.
msgctxt "runctxt: period=[i|h|d|m|y]"
msgid "during the previous"
msgstr "tokom poslednja"
msgscr "tokom $[by-context period i poslednja ... y poslednje]"

by-context call would take the runtime context keyword followed by pairs of context value and string to return for it. This call could be defined as (omitting some bells and whistles):

function select_by_message_context (ctxtkey /*...*/)
{
    var ctxtval = message.runctxt(ctxtkey);
    for (var i = 1; i < arguments.length; i += 2) {
        if (ctxtval == arguments[i])
            return arguments[i + 1];
    }
}
domain.set_call("by-context", select_by_message_context);

The new element here is the use of the message.runctxt method to fetch the value of the runtime context with given keyword. Note that if there is no match, this function will silently "fall off" the end; in JavaScript this means that the return value will be undefined. Since a valid scripting call must return a string, returning undefined will automatically cause fallback to ordinary translation.

6.7. Setting Properties on Non-Native Messages

Not all text shown in a Gettext-based program comes from PO catalogs at run time. One frequent source of such text are .desktop files, which store basic information about a program, such its name, version, execution pattern, etc. Information from installed .desktop files is read by other programs, such as program launchers to list available programs for starting, or file managers when offering programs to open a file. Here is an excerpt from a .desktop file:

[Desktop Entry]
Exec=planner %F
Icon=gnome-planner.png
Type=Application
# ...
Name=Project Management
Name[am]=የዕቅድ ጉባኤ
Name[ar]=ﺇﺩﺍﺭﺓ ﺎﻠﻤﺷﺭﻮﻋ
Name[be]=Кіраваньне праектам
Name[bg]=Управление на проекти
Name[ca]=Gestió de projectes
Name[cs]=Správa projektů
Name[da]=Værktøj til projektstyring
Name[de]=Planner Projektverwaltung
# ...

You can observe that translated entries are stored within the file itself. When a program reads an entry from a .desktop file, it will automatically fetch the translation into current user's language (if there is one). But, since these translations do not come from a compiled PO file, there can be no scripted translation to set properties on them (declension forms, gender, etc). How to work around this?[28] For this we need some ingredients.

The first ingredient are property maps, or pmaps for short. Somewhat similar to the previously mentioned domain.set_phrase_property(phrase, prop, value) method to set a single property on the given phrase, there would exist a domain.load_phrase_properties(filename) method to read an arbitrary number of phrases and their properties from a text file. This file is called a property map, and it would have a very simple format:

# cities.pmap
=:Athens:Atina:nom=Atina:gen=Atine:dat=Atini:acc=Atinu::
=:Paris:Pariz:nom=Pariz:gen=Pariza:dat=Parizu:acc=Pariz::
...

Each phrase entry would start with two characters. The first character (here =) would be taken as the key-value separator in a property, and the second character as the separator of properties (here :). A property without a key (i.e. without a key-value separator) would be understood as the phrase itself. There could be more than one phrase, e.g. to cover alternative spellings, or as in this example, in case at run time a non-translated name is received for whatever reason. The entry would be terminated by double property separator. Newline would have no special meaning, so it could be used within phrases and property values. Comments could be written after # placed before an entry start (within an entry, # would have no special meaning).

The second ingredient is the realization that, although an installed .desktop file carries its own translations, it was likely translated through a PO file. This works by applying a tool which extracts translatable entries from .desktop files into a POT file, and then takes the translations from PO files and inserts them as entries back into .desktop files (insertion can happen at various moments, for example at build time). In this way translators can keep using the standard PO workflow and tools, instead of having to adopt another workflow and tools for every special file format that comes along. So, in one or the other PO file, translators will see the Name= entry from the .desktop file snippet above as:

#: ../data/planner.desktop.in.in.h:2
msgid "Project Management"
msgstr ""

Putting the two together, the translator can set properties by adding a pmap entry to a translator comment in the PO file:

# pmap: =:gen=Upravljanja projektom:dat=Upravljanju projektom:
# pmap: =:acc=Upravljanje projektom:ins=Upravljanjem projektom:
#: ../data/planner.desktop.in.in.h:2
msgid "Project Management"
msgstr "Upravljanje projektom"

The comment starts with pmap: keyword, and then an almost normal pmap entry follows. The only difference to an entry in a standalone pmap file is that the phrase (no-key property) is missing, and that is because the phrase is the translation itself. In order not to have to put a very long comment line, any number of pmap: comments can be given, and all the defined properties will be unified. A tool would be made to go through a PO file, collect all pmap entries, and write them in a pmap file. This generated pmap file would be put in the scripting module of the PO domain, and the main.js file in there would contain a domain.load_phrase_properties call to load the pmap file.[29]

You may have noticed that this actually does not solve the problem with .desktop files. It works only when the extracted pmap entries are used in messages found in the same PO domain. There are situations when this is the case, like when a program uses custom files for storing its internal text data together with translations. But the point of .desktop files is that data in them is used across programs, i.e. in messages in various PO domains. On the technical level, this can be handled by putting generated pmap files into a designated system location and loading them in the scripting module of the generalized Gettext library itself, through a global gtcore.load_phrase_properties call. Then, when a property is looked up by a scripting call in a certain PO domain, first domain-specific property storage would be checked, followed by the global storage. To do this on the organizational level (distribution, packaging) should not be too hard either. Programs would simply install generated pmap files into the designated location, which they would obtain from the configuration of the generalized Gettext library (e.g. through pkg-config). There are few details to work out here (e.g. the designated location, could be within the gengettext scripting module or elsewhere), but nothing of conceptual difficulty.

7. Remarks

This sections contains various remarks about design, implementation, and use of proposed Gettext extensions.

7.1. Converting Existing Sources

One concern may be what would happen to existing program sources once generalized Gettext is introduced. Most importantly, there would be no backward incompatibility and no current feature of Gettext would become deprecated. So, simply, all existing programs could continue to use current Gettext, and new programs could be written in that way. Due to strict format separation, current and generalized Gettext calls could even coexist in the same program, for example, to facilitate incremental conversion.

If the maintainers of an existing program wish to fully and immediately convert to generalized Gettext, that should not be too hard either. A script could be written to handle the majority of cases, which are either no-argument messages, or simple argument substitutions. The minority of special cases, e.g. where arguments are substituted at a later point, can be reported by the script and handled manually. I had done something like this for the KDE 4.0 release, switching all the sources in KDE repository (about three million lines) to a new translation call syntax. It took me several days to make the conversion script and test its behavior, and several more days to manually convert special cases.

7.2. Generalization As a Layer Above Gettext

It has already been mentioned that the generalized Gettext library would have to have well-fitted bindings for various coding frameworks, and that these bindings could be provided by those coding frameworks themselves. Going one step further, this whole document could be understood as proposing a specification rather than implementation. There could be several generalized Gettext providers, implemented as a layer above the Gettext proper. For example, one such implementation could be based on QtCore and QtScript, and another on Glib and SpiderMonkey. This may look like wasted effort, but the implementation should be rather lightweight if the foundation libraries provide all the needed bits: a string type, a few container types, IO streams, locks, an XML parser, regexes, and a JavaScript interpreter. Therefore the advantage of not introducing another low-level dependency chain into the stack may be more important than having a single implementation.

Another reason for implementing generalization as a layer above Gettext could be to test and polish all the features in the real world use, and then introduce the refined result into the Gettext itself.

Some support from Gettext side could be provided outright, though. For one, activation of gg*-format flags through extra elements in the -k option of the xgettext command should not be a big problem, since this does not interfere with any other functionality of xgettext. The flags should perhaps be changed somewhat, e.g. to xgg*-format, to leave the original suggestions for the time when the generalization would become part of Gettext itself.

More problematic would be the msgscr message field. It would be strange to introduce a new PO field into Gettext which is not used by Gettext itself. On the other hand, this would make sense if the intention would be to make one implementation a part of Gettext at some point. If the msgscr field is not introduced, it would be possible to write it within the msgstr field, with a special separator, here |/|:

#, ggx-format
msgid "Events on {date}"
msgstr "Događaji na {date}|/|Događaji $[on-date {date}]"

While this may be "dirty", it is much less so than the earlier widespread approach of embedding the disambiguation context (before msgctxt was introduced), for example:

msgid "application name|Unknown"
msgstr "nepoznat"

This is because a translator who knows nothing about scripting would have nothing to mess up, unlike with embedded context which unaware translators happily translated along with the text. The separator should be chosen a bit more verbose and unlikely to be a literal part of the text, like |/| is above, in order to practically never have to escape it. If the separator is chosen in this way, it can even be required that the original text must not contain that sequence.

Other than the format flags, the call specification elements, and the msgscr field, there is nothing else that current Gettext would really have to support. Everything else would be part of the generalization layer.[30]

7.3. Translating Non-Native Sources

PO files are used to translate just about everything, not only user interfaces of programs that use Gettext calls at run time. There are two types of non-native sources, dynamic and static. Dynamic sources are programs which use a system other than Gettext to translate their user interface. Examples of these are programs within Mozilla and OpenOffice[31] suites. Static sources are any kind of text meant to be read standalone and sequentially, such as program documentation or (in some instances) web pages.

Non-native sources are translated over PO files with the help of an extractor-injector tool. This tool first splits all text in the source file into distinct segments, and extracts the segments as PO messages into a POT file. The POT file is used in the PO translation workflow as usual, resulting in translated PO files. The tool then takes the original document and the translated PO file, and injects translations in place of (or in addition to) original text in a copy of the source file. This produces the final translated source file. With some tweaks to the extractor-injector, all non-native sources, dynamic or static, can be represented through generalized Gettext PO messages.

For dynamic sources, the extractor-injector should be made to convert any native argument placeholders (or formatting directives) into uniform placeholders, and any native markup into customizable markup. For example, if PO messages from a non-native source would currently look like this:

#: src/minimap.cpp:90
msgid "Could not get image for terrain: $terrain."
msgstr ""

#: data/core/encyclopedia/drakes.cfg:26
msgid ""
"The time that passes between one <ref>dst=egg text='egg'</ref> "
"laying to the next."
msgstr ""

the extractor-injector could be upgraded to convert and back-convert placeholders and markup, to produce proper generalized Gettext messages:

#: src/minimap.cpp:90
#, ggx-format
msgid "Could not get image for terrain: {terrain}."
msgstr ""

#: data/core/encyclopedia/drakes.cfg:26
#, ggx-format
msgid ""
"The time that passes between one <ref dst='egg'>egg</ref> "
"laying to the next."
msgstr ""

Customizable markup is the key here, since it enables straightforward mapping of native markup. If the generalized Gettext markup were fixed instead, it would be hard, ugly, or even impossible to meaningfully map native markup.

If a native argument placeholder contains a formatting sequence, that sequence can simply be dropped on extraction, and put back on injection. For example, if the native string were "Total ~,2@f kg of iron loaded.", the converted string would be "Total {1} kg of iron loaded.". Of course, in peculiar situations where exposing formatting sequences to translators would be preferable, that too is possible: "Total {1:,2@f} kg of iron loaded.".

Static source files are normally written in some sort of a formal markup language, like Docbook, or semi-formal markup, like a wiki. For non-XML based markup, it would be necessary to fully map it to an XML-like tags, but that should be straightforward. In fact, this approach is nothing new: I have seen at least two instances where a non-XML markup was converted to Docbook for translation. The tag freedom of customizable markup of generalized Gettext can only make this procedure easier.

XML-based static source formats, like Docbook or Mallard, would simply remain themselves. Only two small tweaks would be introduced.

The first tweak concerns custom XML entities, such as:

#. Tag: phrase
#: index.docbook:121
msgid "A typical &kgoldrunner; game"
msgstr ""

Generalized Gettext customizable markup does not allow custom entities, for the reasons outlined before, so there would be two ways to go about this. The first way is to expand entities on conversion:

#. Tag: phrase
#: index.docbook:121
#, ggx-format
msgid "A typical <application>KGoldRunner</application> game"
msgstr "Tipična partija <application>K-zlatobojca</application>"

Expansion is the right thing to do for most of the entities. This is because translators may need to modify the text behind the entity, either to add some grammar modifications (such as case endings on nouns), transcribe the text another script (such as Cyrillic), outright translate it, or any combination of those (such as above: translation with case ending). If expansion is technically difficult or a large chunk of static text is inserted[32], the entity should be replaced with a uniform placeholder:

#. Tag: phrase
#: index.docbook:121
#, ggx-format
msgid "A typical {kgoldrunner} game"
msgstr "Tipična partija {kgoldrunner[K-zlatobojca]}"

The translation of this message introduces another feature of uniform placeholders: the [value] extension. This extension allows the translator to override the original argument that would have been substituted with an arbitrary value. Bracketing is necessary because the value could contain a character which starts another extension, such as !, :, etc. Literal square brackets would be escaped as usual, by doubling.

The second tweak is about custom elements that some extractor-injectors insert into text to represent cut-out portions of the text. For example, if a Docbook paragraph containing a footnote:

<para>Looking at [...] a preview image for each individual person
<footnote><para>This of course also applies to places, keywords,
and other [...].</para></footnote> as can be seen in [...].</para>

would be extracted with xml2po command, two PO messages would appear:

#: browsing.docbook:53(para)
msgid ""
"<para>Looking at [...] a preview image for each individual person"
"<placeholder-1/> as can be seen in [...].</para>"
msgstr ""

#: browsing.docbook:54(para)
msgid ""
"This of course also applies to places, keywords, and other [...]"
msgstr ""

Here <placeholder-1/> is the custom element that represents the place where the footnote was cut out of the main text of the paragraph. Custom elements are not good for validation, since then the validator has to know not only about the normal markup, but also about what a particular extractor-injector could add. This is simple to solve, by using a uniform placeholder instead:

#: browsing.docbook:3(para)
#, ggx-format
msgid ""
"<para>Looking at [...] a preview image for each individual person"
"{placeholder-1} as can be seen in [...].</para>"
msgstr ""

Another requirement is that the extractor-injector produces messages which contain well-formed XML on their own, but that is quite natural, and all the extractor-injectors that I know of already do so.

7.4. History and Acknowledgments

At the end of 2003, several months after I started to work on PO based translations, it had occurred to me that the grammar inflection of my native language could be handled with some simple translation scripting. I quickly put up a proof of concept, that can still be seen here: http://nedohodnik.net/misc/cotras-intro.html. It completely ignored Gettext as such; it was something I wanted to see working just for fun, without a clear intention of going that way. But I did show the concept to others, translators on the KDE translations mailing list. At that point I was given the first practical clue (by Federico Cozzi): yes, looks interesting, but it really should be based on Gettext and PO files.

About half a year passed without me doing anything further about this, but I was slowly getting to know how the KDE base libraries used Gettext internally, and how the KDE code was developed and released. I realized that, since KDE had an internal wrapper layer around Gettext calls instead of using them directly, it was indeed possible to add translation scripting as a "post-processing" step, after the Gettext call and before the translation is send further away. So by the late 2004 I had made a real proposal, including test implementation and performance measurements, to add such a layer in KDE 4 (at that time KDE 3 was the actual major version). It was necessary to wait for KDE 4, since argument capturing would break the binary compatibility of the existing translation layer.

Again I run the proposal through fellow translators, and got more advice. Most importantly, my proposal used s-expression syntax for piecing up the complete message, and I was told (by Krzysztof Lichota) that it would look less menacing if it used string interpolation calls. An interpolation call as described in this document is an s-expression in itself, but localized to only what needs to be scripted, rather than taking over the complete text. At that time I also intended Guile to be the scripting language, but I was given a strong warning (by Stephan Kulow) that a general-purpose language, which can reach into the system, would be unacceptable from at least the security standpoint. The last version of that proposal, after a few updates during 2004 and 2005, remains at http://nedohodnik.net/misc/ktranscript.html (other than being a historical item, there is also the section "Performance Considerations", with test cases and performance measurements to show that runtime performance of a translation scripting system is quite sufficient).

More time passed, and by the end of 2005, things started to develop. KDE code was branched for what was to become KDE 4.0 two years later, and I had a good grasp of how the KDE translation system worked internally. It was using Gettext in a bit custom way, e.g. having own syntax for plural and disambiguating contexts, and a patched version of the xgettext command was needed to extract messages. Around that time msgctxt was introduced to Gettext, so I took up to convert KDE base library and applications to work with standard Gettext. That enabled me to prepare a slot where translation scripting layer could be inserted, should it be accepted at some point in the future.

This included switching translation calls to argument capturing. This was nice to have even without scripting, since it made possible automatic locale-formatting[33] of arguments, and translation calls became shorter. The latter was due to the fact that two variants of conversion to native string (QString) were provided, explicit conversion and immediate resolution. Implicit conversion was omitted, just-in-case. Combination of explicit conversion and immediate resolution calls worked nicely ever since, without anyone complaining (or at least not loudly enough) about lack of implicit conversion. I had considered some other approaches too, but various people have helped eliminate them (e.g. Nicolas Goutte was particularly involved into that aspect).

By spring 2006, the switch to standard Gettext and new translation call syntax was completed. With KDE 4.0 release not yet nearly in sight, I decided to wait a bit before inquiring again about translation scripting. I did this near the end of 2006, again taking Guile for the interpreter, and trying to back up with some security features (e.g. ignoring scripted translation when process is running under root permissions). Nevertheless core developers were strongly opposed, but, luckily, there was a workable option. In the meantime, the JavaScript interpreter of the Konqueror web browser, KJS, became part of the KDE base libraries. Now there was an interpreter which was both a part of KDE base libraries and of the sort intended to run in a sandbox.

So, JavaScript it was, and in the first half of 2007, I added the translation scripting layer into the prepared slot. At that time I wasn't really thinking of doing this on the level of Gettext, since it was a highly experimental attempt. Scripted translation was written into msgstr field, after the |/| separator; there was no means of escaping, the text was simply forbidden from containing this sequence. This too worked nicely ever since. There were no cases where an inexperienced translator messed something up, nor any other ill effect.

Translation scripting as implemented at that time basically hasn't changed since. Some functions were added as needs arose (e.g. handling of property maps), and that was it. The single biggest change was adding the possibility to have scripting calls within scripting calls. It is interesting how this came about. Originally I intended for "call within a call" scenario to be handled by defining a new specialized call in the scripting module for the PO domain that needed it. But, in discussion with translators (notably Yukiko Bando), it became apparent that it would be really convenient to have as many as possible scripting calls provided by the KDE base libraries scripting module (which stood for generalized Gettext library as proposed in this document), and introduce domain-specific scripting modules only for truly exceptional cases.

The translation scripting system proposed by this document is generally the same as the one currently in KDE, except that it is much stricter with respect to namespaces. In the current KDE system, all translation-related extensions to JavaScript are methods of a single object (called Ts) instead of being thematically separated into different objects as proposed now (message, domain, etc). Also, all property setting on phrases is global, meaning that properties set in the program's PO domain can clobber those in an underlying library's PO domain. None of this has created a single problem in practice so far, but I considered it prudent to tighten this aspect in the current proposal.

In the latter half of 2007, some programmers expressed a wish for a way of semantically marking up elements in messages. They were fed up with thinking whether to put a quote, bold face, or whatever else, around e.g. a file name, and wanted instead to simply say "this is a file name". We considered possibilities of doing this outside of the string, in argument substitution methods, but all ways turned clunky. They would have also taken markup out of the translator's sight, which was a pity since semantic markup also provides some context. Therefore we came to the approach of having XML-like semantic markup directly in strings. Selection of target markup (when converting to native string) was done only through interface contexts, and in no other way.

Unlike translation scripting, however, semantic markup did not work that well. I could see programmers frequently struggling to choose the right tag from the provided fixed set, even if that set was "carefully chosen" (discussed with translators, etc). From the technical side, a translation object was not made to substitute another translation object as an argument directly, so markup was always resolved early. This led to problems with wrong selection of target markup, and automatic escaping could not be done. This in turn caused artifacts due to (lack of) escaping. It is not that there were big problems in practice, but the details were ugly enough, and fixed tag set too limited. It was also allowed to use Qt Rich Text markup alongside semantic tags, and most programmers simply continued using only those (which would be just fine in the proposed customizable markup approach).

Although interface contexts were presented as an element of semantic markup, they were used much more than actual in-string markup. Programmers were never uncomfortable with adding interface contexts on request (or when known "i18n fixers" would commit contexts themselves), and some maintainers have even equipped most of their programs with interface contexts. No translators have complained about the increased workload (which is, as shown earlier, on the order of 3% by word count in PO files highly equipped with interface contexts). The proposed set of major and minor components of interface context in this document was carried over from the current KDE system without modifications. This set has almost never changed since inception (only :row minor component was added later), indicating its stability.

I have mentioned a few people by name above, but many others have given valuable advice (and reality checks) along the way. To name a few more: Harri Porten and Maksim Orlovich, core developers of KJS, have given me a hand with integrating KJS as the JavaScript interpreter; Bruno Haible, the maintainer of Gettext, has added support for KDE 4 translation system into Gettext tools; Oswald Buddenhagen, the maintainer of Qt Linguist system, has discussed with me a lot about what we could do for translations in KDE/Qt 5. That discussion prompted me to write this document.

7.5. Other Approaches to Translation Scripting

As can be seen from the history overview, this proposal grew out of the original drive for introducing translation scripting. Other people had similar ideas.

In 1998, Sean M. Burke had a brush with plural handling, for which Gettext had no support at the time. He realized the full implication and proposed a function-based replacement for Gettext, called Maketext, for use in Perl programs. His text can be found at http://search.cpan.org/dist/Locale-Maketext/lib/Locale/Maketext/TPJ13.pod. To my knowledge, this was the first published concept of translation scripting.

Games are especially susceptible to combinatorial effects of argument insertion. One such game is OpenTTD, and its developers resorted to expanding a simple ID-based translation system with facilities to handle plural, case and gender aspects of argument insertion. A document describing this can be found at http://wiki.openttd.org/OpenTTDDevBlackBook/Format_of_langfiles.

Axel Hecht, of the Mozilla Foundation, initiated in 2007 a new translation system, called L20n. It uses a cascading dictionary-like selection system for grammar manipulations, has some arithmetic and logical expressions capability, and defines a custom translation syntax and file format. While L20n comes from the Web oriented background, in principle it can be used in other contexts too. The current state of L20n, as well as an illustrated guide by way of examples, can be seen at http://l20n.org.



[1] However, it is essential to support C too. As soon as an interface for C is drafted, it will be included in this document.

[2] Qt provides its own translation system, the Qt Linguist. With recent releases, it became every bit as capable as Gettext on the coding and runtime side, albeit intended for use only in Qt-based programs. But in this example we will ignore Qt Linguist, and use Gettext as in any random C++ program.

[3] printf argument reordering extensions provided by Glibc (%1$s, %2$s) do not work in Python.

[4] You may think the solution is wrapping for translation all literal combinations: "seven o'clock", "five past seven", "ten past seven", ..., "eight o'clock", "five past eight", "ten past eight", .... The 144 short messages it would produce in this case may yet be acceptable. But think of a case where the combined messages would be sentence-long (and especially when that sentence is later changed a bit in the code), or where the number of combinations would be in the thousands. Literal combinations are also highly inconvenient on the code side, and almost always have to be made after the fact, rewriting the code.

[5] Translators typically look for changes in the actual text, and pay less attention to the formatting particularities, since these are usually simply copied when translating from scratch. Of course, it helps greatly to use a capable PO editor which automatically highlights the difference from the previous to the current original text, but there was no real reason for the message to be fuzzied in the first place.

[6] See later for why all messages got the format flag, instead of only those with placeholders.

[7] If an argument has not been provided, some sort of warning or error should happen, telling the programmer to fix the code.

[8] Unless the coding framework disallows this otherwise, e.g. due to the problems with non-POD file static data in C++.

[9] This approach has been followed in KDE 4 translation system (*i18n calls), and applied to general strings in C++ bindings for Glib, the glibmm library (as ustring::compose). To my knowledge, it first appeared in 2004 as Ole Laursen's compose library, precisely with the intent to be used in conjunction with Gettext.

[10] At the time of writing, among the 180,000 messages in the KDE Translation Project there were exactly 2 (two) with more than 9 arguments.

[11] Generalized calls will always somehow signal if there is a mismatch, between the placeholders and supplied arguments at the moment of resolution, or in placeholders between the singular and plural strings. So this actually works the other way around: the only silently accepted mismatch will be that for plural-deciding argument in plural messages.

[12] Especially formatting directives which are used in a limited, problem-specific environment. One such example would be Wesnoth Markup Language (WML).

[13] This is because prepositions (like "from", "to", "by") heavily depend on what they apply to, and this dependency varies across languages, so the translator always has to know what is substituted to correctly translate the related preposition.

[14] On the opposite, the KDE Translation Project is not suitable for this examination, since its user interface catalogs contain Qt style ordinal placeholders.

[15] For that matter, everyone would be fine with 756 mmHg.

[16] This is, for example, a big problem with the Docbook format. Although intended as end-all for writing software documentation, many eschew it for simpler, even purely visual formats.

[17] The markup definition syntax in this Python example could be made more Pythonic in reality. Basic function call syntax is used to make it more obvious how it would work in a syntactically less expressive language, like C++.

[18] Regardless of markup, this is probably not a good splitting for translatability. It is hard to find an example where non-well-formed splitting would make sense in every aspect except for impeding validation.

[19] I personally consider this a flawed solution, on several levels. But being widespread as it is, I have to include it in the discussion.

[20] Tilde is mostly found in references to paths within the user's home directory on Unix-like systems.

[21] This call selects appropriate form of the verb based on grammatical number of the applet name, which can be singular or plural. The first argument is the applet name, and the second two arguments are verb forms for singular and plural.

[22] Backslash is not used for escaping because it would overlap with PO format's escaping in string fields. In fact, all escaping in generalized Gettext is intentionally orthogonal to PO field string escaping.

[23] Of course, missing plural call should be reported as a bug in code internationalization, rather than silently scripted away. But scripting can be valid intermediate solution, for example if the code is in hard message freeze prior to release.

[24] This is similar to how a message with disambiguation context is compiled now: msgctxt value followed by '\x04' followed by msgid value.

[25] A Scheme implementation, such as Guile, would have been my own first choice. Unfortunately, the s-expression syntax would likely repel more people than it would attract, and implementations are usually either not that lightweight, or they are more of a special-purpose projects.

[26] When languages are very closely related, such as the Nordic or the South Slavic group, it makes sense for one language to use the scripting functionality of another.

[27] Currently there is a hackish method of placing properly tuned path segments into LANGUAGE, instead of normal language codes.

[28] One reply could be: upgrade .desktop file format such that entries are translated at run time from PO files! Some software distributors have indeed taken that route, although there are some issues with it. Regardless, there will always be some "dumb" translated text, in one format or another, that needs working around.

[29] For automatic use, e.g. in a build system, the pmap extraction tool could do the whole process. It would create the scripting module for the language if it does not exist, put the generated pmap file in it, create the main.js file if it does not exist, and put the pmap loading line in it (or insert it in an existing main.js file).

[30] There is also the part about validating customizable markup, which was proposed to work by adding some functionality to xgettext and msgfmt commands. But markup cannot be validated at present at all, so there would be no loss there.

[31] Although, at the time of writing (August 2011), there is a draft in LibreOffice (community maintained fork of OpenOffice) to switch to Gettext as the native translation system.

[32] Both of these reasons are actually a sign of something being suboptimal in the technical part of the authoring process.

[33] Because Qt has argument placeholders (%1, %2...) instead of formatting directives, a formatting sequence could not be used for locale-formatting purpose. KDE also has its own locale system separate from Qt, so during KDE 3 time it was necessary to explicitly use locale-formatting calls.