Translation Scripting for KDE

Chusslove Illich

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2.

Revision History
Revision 0.2.12005-09-04CI
Updated performance benchmarks (previous contained an embarassing error, resulting in far too good results in one case).
Revision 0.2.02005-08-25CI
Interpolated syntax for scripting. Removed history stuff. Heavily revised examples. Some changes in implementation details.
Revision 0.1.12005-08-03CI
Security related limitations. One new Guile extension.
Revision 0.1.02005-04-20CI
More formalized usage of external scripts (scripting modules), added performance considerations, removed obsolete stuff.
Revision 0.0.02004-08-16CI
Initial release

Abstract

This document presents proposal for translation scripting system for KDE, to overcome some practically unsolvable translation problems encountered so far. It describes scripting facilities for translators, their implementation in KDE core, and some additional rules that KDE application programmers should adhere to. Examples of possible uses in existing KDE applications are also given.


Table of Contents

Introductory Notes
Using The Scripting System
Scripting in PO files
Scripting modules
Debugging scripts
Guile extensions
Scripting examples
Implementation
Changes to i18n interface
Some internal details
Performance considerations
Security limitations
Additional Notes on I18n to Programmers
Using full sentences revisited
Wrapping all user visible strings revisited
Acknowledgments

Introductory Notes

This document is a collection of several sections which logically belong to different places within KDE documentation, but are presented here for more coherent review of this as yet experimental feature. It is transitional and to be abandoned at some point in future.

First section deals with translator's view of the matter, second one is on implementation in i18n handling structure of KDE, and third contains some additional advice on i18n for programmers, which surfaced after introducing the scripting model. Although each section is self-contained and for possibly different public, when read together in given order they provide insight in laid claims and design decisions.

The most detailed section is the first one, because it is intended as future KDE Translation Scripting Guide, and presently it is of great importance to demonstrate the need and scope of translation scripting system.

Feel free to send any comments, advice and additions regarding this document. Its DocBook source is available here.

Using The Scripting System

Scripting is implemented through GNU Guile, GNU's Scheme interpreter, so to be able to use scripting in translations, you should know basics of programming language Scheme. However, since Scheme supports many programming paradigms, if you have any programming/scripting experience, a quick introduction to Scheme should be enough for you to solve great many practical translation problems. For more information on Scheme, see Schemer.org website.

However, even if you have no idea of Scheme, it still might not be too hard to use scripting :) Since Scheme is very adaptable, there will be some very intuitive solutions already available, which you can use for your language as well. Or, if you have a specific need, it might turn out to be very easy to implement to look and feel intuitive, so that someone else can do it for you quickly.

While scripting, at your disposal are full capabilities of Guile, as well as some specific extending functions added through implementation in KDE. Extensions are there to speed up some common procedures and support weaker parts of Guile (e.g. it knows nothing about Unicode, which is essential in translation process).

When you are doing usual, non scripted translation, you are only editing PO files. Once you deploy scripts, you will also have possibility to write custom scripting modules, containing functions to be used by top-level scripts within PO files.

Scripting in PO files

For example, let us take classical message from kdelibs.po, with its translation to Serbian language:

#: kparts/browserrun.cpp:306
msgid "&Open with '%1'"
msgstr "&Otvori pomoću „%1“"

Here, placeholder %1 is the name of application which may be used to open certain document. In English, this message looks fine, but in Serbian, application name should be in genitive case. If the application in question is "Konqueror", then its genitive case in Serbian would have -a ending, i.e. it would be "Konquerora".

This could be scripted very easily. First, we would indicate that message in question is scripted by appending a magic character sequence |/| (called "fence") to end of normal translation:

#: kparts/browserrun.cpp:306
msgid "&Open with '%1'"
msgstr "&Otvori pomoću „%1“"
"|/|"

After that, a simple top-level script would follow, so that full PO entry is:

#: kparts/browserrun.cpp:306
msgid "&Open with '%1'"
msgstr "&Otvori pomoću „%1“"
"|/|"
"&Otvori pomoću „$[genitive %1]“"

Now, this isn't much of a "script", is it? :) It only happened that %1 from normal translation got expanded to $[genitive %1], pretty obviously stating that we want genitive form of whatever application %1 is. The main element here is the keyword genitive, a Scheme function which you must define in a scripting module (explained in next section).

This kind of syntax is called, influenced by Perl jargon, the interpolated syntax. The $[...] is an interpolation, and it contains a Scheme expression minus the outer pair of parenthesis (square brackets double for them). Once the expression is evaluated, the result it produces is substituted ("interpolated") back into the translation, replacing the original $[...]. Of course, there can be more than one interpolation in a single message.

Placeholders used inside interpolations, like that %1 above, are always UTF8 encoded string variables from the point of view of Scheme expression. If argument was originally, say, a number, it will still be represented as a string value. Also, all interpolations must return UTF8 strings, any other datatype (of many that Scheme expression can return) will cause failure of the complete script.

So, what happens when a script does fail, or it is simply found to be syntactically incorrect at runtime? If you were wondering why there is still an ordinary, non scripted translation before the fence, this is one of the reasons: if the script fails in any way, ordinary translation will be used instead, as a fallback alternative.

Just for completeness, and perhaps to make things more clear for those understanding Scheme, let us state what actually happens to the script given by interpolated syntax: it gets transformed to a lambda expression, one taking string parameters named same as placeholders, and doing string concatenation of literal portions and expressions inside interpolations. For the example above, the script in transformed form would be:

(lambda (%1)
    (string-append "&Otvori pomoću „" (genitive %1) "“"))

This lambda expression then gets applied to arguments that come in with the message at runtime. Now you can also see why script will fail if any interpolated expression returns anything but a string.

Scripting modules

Top-level scripts written in PO file should be kept short, as it is much easier to implement bigger functions in a scripting module. This is a Scheme source file containing a proper Guile module definition, which you can edit with Scheme friendly editor, employing Scheme syntax highlighting, etc.

Scripting module has special name and position in KDE tree, analogous to the PO file which needs it. If the PO file is named foo.po and located in $KDEDIR/share/locale/lang/LC_MESSAGES, the name of the scripting module file should be foo.scm, to be installed in $KDEDIR/share/locale/lang/LC_SCRIPTS. Whenever KDE loads the PO file, it will also automatically load scripting module associated with it, if present.

In example presented above, we used function genitive in top-level script for the message in PO file kdelibs.po. Therefore, this function should be implemented in scripting module file kdelibs.scm, and this how that implementation could look like:

;; Start of kdelibs.scm

(define-module (sr LC_SCRIPTS kdelibs))

(define-public (genitive app)
    (cond
        ;; ------------ Nominative -- Genitive ---
        ((string=? app "Konqueror")  "Konquerora")
        ((string=? app "KWrite")     "KWritea")
        ((string=? app "Konsole")    "Konsole")
        ;; etc.
        (else app))))

;; End of kdelibs.scm

The first line defines a Guile module, using Guile's extension define-module. Module name is defined within inner parenthesis, and is actually whitespace instead of slash-separated relative path to file kdelibs.scm, starting from $KDEDIR/share/locale and omitting file extension.

With help of Guile's extension define-public we define genitive to be outside-visible (exported) function which is taking argument app. In the body of the procedure, there is only one function (a macro to be precise), cond, a Scheme built-in.

Each argument to cond is expression containing a conditional and a return value expression. If conditional of current argument is true, return value is produced, otherwise next argument is evaluated. If no conditional matches, value of expression following else is returned.

So, genitive simply compares nominative form of application name with known applications (using Scheme built-in string=?) and returns genitive form if application name is matched, otherwise it falls back to the nominative form itself.

Keep in mind, when function has to do big number of this kind of comparisons (in hundreds or thousands), a hash-based dictionary search should be used. Fortunately, facilities for that are already provided by Guile, and ready solutions, for you to just copy into your scripting modules, are available.

Debugging scripts

[[TODO... Err, in fact, yet to be decided how :)]]

Whenever a script fails, a debugging message will be displayed on stdout (msgid and any arguments) as problem in area 173 ("kdecore (KLocale)") of KDE debug output system. This area might not be included by default in output, use command kdebugdialog to check it. All debugging messages connected to scripting will have the word Transcript in them, so you can grep the output.

Guile extensions

KDE's translation scripting system provides some additional Scheme functions, which you can use in your scripts. Currently, there are two groups of extensions:

  • Extensions which provide translator with interface to KDE specific i18n/locale matters.

  • String handling extensions, which give ability to correctly handle national scripts, missing by default from Guile.

Interface to KDE locale structure

Extensions in this group allow you to reach some data existing internally within the scripting system, which might be usefull in some situations.

  • msgstr

    Returns ordinary, non-scripted translation. Usefull when there are only some specific cases to be handled, and ordinary translation is enough otherwise.

  • arg ordinal

    Returns an argument string passed to the current message, defined by an ordinal. Ordinal is simply index of the argument in ordered array of arguments; plural argument, if present, counts as first. For example, if message contains placeholders %n, %1 and %3, their ordinals will be 0, 1 and 2, respectively. If ordinal is out of range, false (#f) is returned.

  • msgid

    Returns original message, with placeholders intact (i.e. without expansion).

Unicode string handling

Unicode strings are not implemented as a special data type, they are just normal Guile strings in UTF8 format. That means that you can use all built-in Guile functions to handle Unicode strings, however any function which deals with specific characters in string may not give expected result.

For example, string-downcase should return string in lower case, but if there are any non English letters, they will not be lowercased; on the other hand, function string=? will always work, as it compares strings in global.

In short, whenever you are operating on pure English strings, feel free to use built-in string handling functions, but consider some of those described below when handling strings in your language.

  • ustring->list str

    This function is intended for the similar purpose as built-in string->list, but instead of list of characters, it will return list of strings each containing one UTF8 character of given string. This is because single UTF8 character may be composed of more than one byte, and since built-in Guile characters are single byte only, it takes a string to hold it.

  • list->ustring uslist

    This function is intended for the similar purpose as built-in list->string, but instead of list of characters, it takes list of UTF8 strings uslist and returns single UTF8 string which is concatenation of those in list.

  • ustring-downcase str

    This function takes UTF8 string str and returns properly lowercased version of it (unlike string-downcase, which will not lowercase non English letters).

  • ustring-upcase str

    Everything goes like in case of ustring-downcase, just now uppercased string is returned.

  • ustring-downcase-start str

    This function will take UTF8 string str and return UTF8 string with first non whitespace character lowercased. It can be useful when something may start with lower of upper case letter, depending whether it is in the beginning or in the middle of a sentence.

  • ustring-upcase-start str

    Same as ustring-downcase-start, but it will uppercase first non whitespace character instead.

Scripting examples

Examples given here are taken from real applications, in order to demonstrate both the potentials of scripting and linguistic problems that frequently occur.

Cases of arguments

As soon as one moves away from caseless language, like English, trouble with correct inflection of non literal/non numerical arguments of messages creeps in. Take following two messages from kgeography.po (placeholders stand for names of countries), translated into Serbian language:

#: divisioncapitalasker.cpp:32
msgid "The capital of %1 is..."
msgstr "Prestonica %1 je..."

#: mapasker.cpp:177
msgid "Please click on %1"
msgstr "Kliknite na %1"

In Serbian, which has several cases of nouns, and case endings are appended to the noun itself, these translations sound quite "illiterate". Both have nominative case of country name, although in translation of "The capital of %1 is..." it should be genitive case, and in translation of "Please click on %1" accusative case.

(To be quite fair, KGeography also defines expanded messages of this kind (like "The capital of Greece is..."), but that makes its PO file having over 4000 messages. So the scripted solution is still going to come in handy.)

From the scripting viewpoint, this is the same problem as described in introductory example, so you could solve it in the same way. But, in this case, we can also go for a more elegant solution. Note that in same PO file, you have country names defined separately, for example (more or less):

#: mapsdatatranslation.cpp:115
msgid "Greece"
msgstr "Grčka"

So, we know that by the time one of those problematic messages above comes about, the appropriate country translation will have been invoked already. This fact allows us to set needed forms dynamically in PO file itself, rather than statically in a scripting module.

With this approach in mind, we provide needed cases of country names by scripting their translations like this:

#: mapsdatatranslation.cpp:115
msgid "Greece"
msgstr "Grčka"
"|/|"
"$[set-form 'genitive \"Grčke\" 'accusative \"Grčku\"]"

Intention here is pretty obvious, but some words about the details. The whole script is one big interpolation using function set-form, which we must define in scripting module (but it will be a short one!) This function takes case-form pairs in turn, where case is a symbol (denoted by single quote in front), and forms are strings. Since inside interpolation is Scheme code, strings must have quotes, thus the quoting of forms. Since here we just want to set the forms, and not to mess with translation, set-form must also return ordinary translation itself.

The problematic messages can now be scripted like this:

#: divisioncapitalasker.cpp:32
msgid "The capital of %1 is..."
msgstr "Prestonica %1 je..."
"|/|"
"Prestonica $[get-form 'genitive %1] je..."

#: mapasker.cpp:177
msgid "Please click on %1"
msgstr "Kliknite na %1"
"|/|"
"Kliknite na $[get-form 'accusative %1]"

Again, these scripts are intuitive. We must only define function get-form in scripting module, same as set-form from before. So here they are:

(define form-hash (make-hash-table 31))

(define-public (set-form case form . others)
    (hash-set! form-hash (cons case (msgstr)) form)
    (if (not (null? others))
        (apply set-form others)
        (msgstr)))

(define-public (get-form case base)
    (hash-ref form-hash (cons case base)))

Unfortunately, there is some condensed Scheme code involved here, but it's at least short. And you can keep this in your kdelibs.scm, so that you have dynamic form setting available in any KDE application (as kdelibs.po/.scm is always being read).

One special point to note in the code above are calls to function msgstr. This function reports the ordinary translation, which is being used as part of hash key (the other part being case), and as return value of set-forms (so that the big form-setting interpolation evaluates to ordinary translation, as we said it should).

One can even define set-form in such away that no extra quoting is needed to set forms, and that get-form is not needed at all. That would result in total minimum of scripting syntax:

#: mapsdatatranslation.cpp:115
msgid "Greece"
msgstr "Grčka"
"|/|"
"$[set-form genitive Grčke accusative Grčku]"

#: divisioncapitalasker.cpp:32
msgid "The capital of %1 is..."
msgstr "Prestonica %1 je..."
"|/|"
"Prestonica $[genitive %1] je..."

Alas, to make this possible, set-form will have to be a bit longer and a lot uglier (but hey, you'll have it ready to use...)

Additional plural forms

Frequently it happens that plural form handling, although being able to produce grammatically correct translations, sounds bad in some language. Take a look at this message from kfindpart.po, translated into Serbian language:

#: kfinddlg.cpp:113 kfinddlg.cpp:205
#, c-format
msgid ""
"_n: one file found\n"
"%n files found"
msgstr ""
"%n fajl je pronađen\n"
"%n fajla su pronađena\n"
"%n fajlova je pronađeno"

In English original, for singular case, instead of a bit awkward "1 file found", programmer put nicer alternative "one file found". Serbian translator cannot do the same thing, because singular form is same as plural form of numbers ending with 1 (sans those endind with 11). In his translation all forms must have plural placeholder %n, producing an awkward solution for singular case. Furthermore, it would be nice if there could also be special form for 0, i.e. when no files are found. The script to achieve all this is here:

#: kfinddlg.cpp:113 kfinddlg.cpp:205
#, c-format
msgid ""
"_n: one file found\n"
"%n files found"
msgstr ""
"%n fajl je pronađen\n"
"%n fajla su pronađena\n"
"%n fajlova je pronađeno"
"|/|"
"$[for-n 0 \"Nijedan fajl nije pronađen\")"
"        1 \"Jedan fajl je pronađen\")]"

Script is again just one big interpolation, containing call to function for-n, which is being given the special forms for 0 and 1. So, for-n should return one of those forms if applicable, or one of ordinary forms otherwise. The definition of for-n would go like this:

(define-public (for-n num form . others)
    (if (= num (string->number (arg 0)))
        form
        (if (not (null? others))
            (apply for-n others)
            (msgstr))))

The part (= num (string->number (arg 0))) compares our special case (num) and plural argument ((string->number (arg 0))). Note how we get the plural argument: first, we call (arg 0), which returns the first argument that came with the message, which is the plural argument in case of plural messages; then, since all arguments come as strings, we must convert it to number using Scheme function string->number. Now, if special case matches the plural argument, we return corresponding form. Otherwise, if there are some special forms left ((not (null? others))), we go on to examine them ((apply for-n others)), and if not, we return ordinary translation ((msgstr)).

Multiple plurals

What happens if original message contains more than one argument requesting plural form? This case is unsolvable by normal plural handling, the only thing one can do is make it the least bad. In kalarm.po you can find such a case (translation into Serbian):

#: messagewin.cpp:560
msgid ""
"_n: in 1 hour %1 minutes' time\n"
"in %n hours %1 minutes' time"
msgstr ""
"u roku od %n sata %1 minute\n"
"u roku od %n sata %1 minuta\n"
"u roku od %n sati %1 minuta"

Translator could handle only plural of hours, but not plural of minutes. One possible scripted solution would be this:

#: messagewin.cpp:560
msgid ""
"_n: in 1 hour %1 minutes' time\n"
"in %n hours %1 minutes' time"
msgstr ""
"u roku od %n sata i %1 minute\n"
"u roku od %n sata i %1 minuta\n"
"u roku od %n sati i %1 minuta"
"|/|"
"u roku od %n $[plural %n sata sata sati] "
"i %1 $[plural %1 minute minuta minuta]"

This one might be a bit harder to parse at first sight. It is actually a single sentence of the form "u roku od %n $[...] i %1 $[...]", where the two interpolations are calling function plural with appropriate argument and three forms, one of which should be returned based on argument value. The first interpolation gets %n and forms of "hours", and the second %1 and the forms of "minutes".

Function plural is to be defined in a scripting module, and it is obviously language dependent. We define it as a macro, so that we can use it as shown above, without need to quote forms which are just single words. For Serbian language, here is how it could look like:

(define-macro (plural numstr form1 form2 form3)
    (cond
        ((string-suffix? "11" numstr) (->string form3))
        ((string-suffix? "12" numstr) (->string form3))
        ((string-suffix? "13" numstr) (->string form3))
        ((string-suffix? "14" numstr) (->string form3))
        ((string-suffix?  "1" numstr) (->string form1))
        ((string-suffix?  "2" numstr) (->string form2))
        ((string-suffix?  "3" numstr) (->string form2))
        ((string-suffix?  "4" numstr) (->string form2))
        (else (->string form3))))

Function cond takes a look at cases of passed number string (numstr) and chooses one of three forms according to Serbian language rules. cond will go through conditions in turn, and return the form associated to first condition that is true; if none is, the form corresponding to final else is returned.

Function string-suffix? is built in and checks if the first argument string is suffix of second argument string, thus checking number endings in this case. Function ->string is a custom one, used here to make forms into strings in case they weren't supplied as such (like in the previous script). You can define it like this:

(define-public (->string thingy)
    (cond
        ((string? thingy) thingy)
        ((symbol? thingy) (symbol->string thingy))
        ((number? thingy) (number->string thingy))
        (else (error "Cannot convert into string:" thingy))))

Unimplemented plural forms

Frequently it happens that programmer didn't implement plural handling, using instead only plural or ending nouns with "(s)". Although this is usually considered as a bug in application, and therefore should be promptly fixed, KDE applications are full of such cases.

Moreover, it sometimes happens that implementing plural handling is both unnecessary from the point of English language, and would be very tedious (or even impossible) to do. Consider following example from konversation.po, translated into Serbian language:

#: ircinput.cpp:310
msgid ""
"You are attempting to paste a large portion of text "
"(%1 bytes or %2 lines) "
"into the chat. This can cause connection resets or flood "
"kills. Do you really want to continue?"
msgstr ""
"Pokušavate da prenesete veliki deo teksta "
"(%1 bajtova ili %2 linija) "
"u ćaskanje. To može izazvati resetovanja veze ili poplavna "
"izbacivanja. Želite li zaista da nastavite?"

From the context, "a large portion of text", it is obvious that singulars will never occur, so having only plural forms is perfectly fine for English. However, if you recall discussion about plural forms in Serbian in example of additional plural forms, you can see that ordinary translation cannot cover these cases.

But what can programmer do about this problem, especially considering that everything is fine in English, so any further work becomes unnecessary overhead? He would have to do some acrobatics of handling both plurals outside of the message and than splitting them in, risking that it will still not work for some languages.

Instead, it is much easier for translator to script it, using function plural, as described in example of multiple plurals. Here is how that script would look like:

#: ircinput.cpp:310
msgid ""
"You are attempting to paste a large portion of text "
"(%1 bytes or %2 lines) "
"into the chat. This can cause connection resets or flood "
"kills. Do you really want to continue?"
msgstr ""
"Pokušavate da prenesete veliki deo teksta "
"(%1 bajtova ili %2 linija) "
"u ćaskanje. To može izazvati resetovanja veze ili poplavna "
"izbacivanja. Želite li zaista da nastavite?"
"|/|"
"Pokušavate da prenesete veliki deo teksta "
"(%1 $[plural %1 bajt bajta bajtova] ili "
"%2 $[plural %2 linija linije linija]) "
"u ćaskanje. To može izazvati resetovanja veze ili poplavna "
"izbacivanja. Želite li zaista da nastavite?"

Script is the same sentence as ordinary translation, but with fixed translations of "bytes" and "lines" replaced by interpolations of function plural on appropriate arguments and forms.

Implementation

Should you like described capabilities of translation scripting, the main question is: When can it be introduced into KDE? Unfortunately, implementation details described bellow would break binary compatibility of KDE libraries, so it cannot be introduced before KDE 4.

Changes to i18n interface

Basic translation facility, in non scripted variant, is function i18n(), which takes message string with argument placeholders, and returns translated QString, again with placeholders. Afterwards, arg() methods of class QString would replace placeholder with real arguments. This how i18n() is declared, in kdelibs/kdecore/klocale.h:

QString i18n(const char *text);

And here is a typical call of i18n() in application code:

general->latencyLabel->setText(
    i18n("%1 milliseconds (%2 fragments with %3 bytes)")
         .arg(latencyInMs).arg(fragmentCount).arg(fragmentSize));

This kind of interface cannot be used if scripting is to be employed, because arguments come into focus after translation had already been made. Instead, translation must be delayed until all the arguments have been supplied, so we need some sort of argument-capturing scheme.

To this end, i18n() functions are switched to be only wrappers for new KI18n class:

KI18n i18n(const char *text)
{
  return KI18n(text);
}

Class KI18n is inherited from QString, and provides capturing of arguments through overridden arg() methods of QString:

class KI18n : public QString
{
  private:
    KI18n(const char* text);
    //...
  public:
    KI18n arg(const QString& a, int fieldWidth = 0) const;
    //...
};

Constructors of KI18n do the job of old i18n() functions, and more. They are made private because we want only i18n() functions to be used for wrapping strings (so that message extraction tool can do its job properly), and to avoid implicit conversions to KI18n. This means that i18n() functions are declared as friends of KI18n.

Inheritance from QString means that existing code feels almost no effect of the change. Few and far between problems encountered were all connected to appearance of chained implicit conversions; where one had implicit Something->QString, which is legal, now it would need implicit Something->QString->KI18n, which is illegal. This was occurring mainly in ternary condition statements.

Now it is easy to capture arguments and control when the translation is performed. First, constructor will count number of placeholders, and set number of already supplied arguments to 0. Then, overriding arg() methods will store each added argument within class, increase number of supplied arguments, and check if it is equal to counted number of placeholders. If it is, translation will be performed, with arguments supplied to scripting system.

There are also some failsafes employed. It may happen that script fails, or that class never gets enough arguments, i.e. gets used as QString before it can evaluate script. This other case can happen if programmer did something like this:

QString msg = i18n("%1 milliseconds (%2 fragments with %3 bytes)");
general->latencyLabel->setText(msg
    .arg(latencyInMs).arg(fragmentCount).arg(fragmentSize));

In order to have at least ordinary translation in these cases, already in constructor of KI18n underlying QString will be initialized with ordinary translation, and each arg() method will also update underlying QString. Only when all the arguments are supplied and script evaluation succeeds, underlying QString will be replaced with what is returned by the script.

As for the other two variants of i18n(), which handle context info and plural forms, they are also replaced with two wrappers to analogous constructors of KI18n. For these cases, conceptually everything is the same, just there are a few additional internal details to be considered.

There is also the possibility to disable script evaluations, by setting boolean flag TranScript=false in user configuration file $KDEHOME/share/config/kdeglobals, section [Locale].

Some internal details

[[Probably no point in expanding this]]

To implement scripting system, files kdelibs/kdecore/klocale.h, kdelibs/kdecore/klocale.cpp and kdelibs/kdecore/Makefile.am were modified, and kdelibs/kdecore/ktranscript.h and kdelibs/kdecore/ktranscript.cpp were added.

Within klocale.* files, following changes to existing code were made:

  • i18n() functions were modified to return newly added KI18n class instead of QString, as described previously.

  • translate() methods of class KLocale were upgraded to optionally report which language did they actually took translation from. This is needed to resolve which scripting module should be used when two or more languages are defined, and both have a scripting module.

  • Class KLocale got two new methods, useTranScript() and setUse TranScript(), to report and set whether script evaluation is enabled. They refer to new boolean variable m_useTranScript within internal class KLocalePrivate.

Performance considerations

To determine the limit of performance hit that scripting introduces, we shall observe the performance of a loop executing almost nothing but an i18n call of certain type, and compare that with the same loop in non-scripted KDE. Three kinds of i18n calls are considered.

Simple messages (SIMPLE)

Simple messages are those one taking no arguments (i.e. no placeholders in it) and requiring no scripting. Since normally most (say 99%) of messages wouldn't be scripted, this case will show the pure overhead caused by the scripting infrastructure.

The i18n call inside the loop references random messages from kdelibs.po. The code looks like this (vector msgids contains UTF8 encoded msgids):

srand(0);
for (uint i = 0; i < BIGNUM; i++)
    i18n(msgids.at(rand() % msgids.size()));

Form handling messages (FORMS)

In this example, message is taking three arguments, all of which must have proper grammatical form. This represents an exaggeration of one expected case when scripting will be used, to obtain proper forms of words or phrases in different contexts.

The i18n code is this:

QValueList<QString> ra; // Radio alphabet, 26 words.
ra.append("Alpha"); ra.append("Bravo"); ra.append("Charlie");
...
ra.append("Xray"); ra.append("Yankee"); ra.append("Zulu");

uint len = ra.size();
srand(0);
for (uint i = 0; i < BIGNUM; i++)
    i18n("Just checking: %1, %2, %3...\n")
        .arg(ra[rand() % len])
        .arg(ra[rand() % len])
        .arg(ra[rand() % len]);

And the scripted translation in PO file is:

msgid "Just checking: %1, %2, %3..."
msgstr "Samo proveravam: %1, %2, %3..."
"|/|"
"Samo proveravam: $[radio %1], $[radio %2], $[radio %3]..."

Workhorse of this script is function radio, defined in scripting module. For efficiency of search, it uses Guile's hash tables with symbols as keys:

(define radio-codes (make-hash-table 31))
(hash-set! radio-codes 'alpha "avala")
(hash-set! radio-codes 'bravo "beograd")
(hash-set! radio-codes 'charlie "cetinje")
...
(hash-set! radio-codes 'xray "iks")
(hash-set! radio-codes 'yankee "ipsilon")
(hash-set! radio-codes 'zulu "zagreb")

This means that arguments passed to the script must be normalized, in order to be transformed into symbols for hash keys. Normalizing is done by removing any spaces, dashes and parenthesis from the argument string, and then lowercasing it; the function norms:

(define (norms str)
    (ustring-downcase
        (string-delete str
            (lambda (c) (or
                (char=? c #\space)
                (char=? c #\-)
                (char=? c #\()
                (char=? c #\)))))))

Finally, function radio has this form:

(define-public (radio code)
    (let ((trcode (hash-ref radio-codes (string->symbol (norms code)))))
        (if trcode trcode code)))

In first line it normalizes string code, transforms it to symbol and tries to find its value in the hash. In second line it returns the translation if found, otherwise the original argument.

Plural message (PLURALS)

When message has plural forms, translator may want some forms additional to those represented by KDE rules for his language (eg. for 0 or 1). This is also one expected usage of scripting, contains more processing of arguments than in FORMS case, and stresses the i18n call for scripted plural handling, implementation of which has biggest overhead (compared to ordinary and context-info i18n calls).

The loop with i18n call is simple:

srand(0);
for (uint i = 0; i < BIGNUM; i++)
    i18n("Commissioned one ship in total.",
         "Commissioned %n ships in total.", rand() % 100);

The programmer has used nicer wording for case of unity, putting "one" instead of awkward "1". Translator may not be able to do so, if in his language the form for 1 is needed for other numbers as well. Therefore he writes the special plural handling function, which can deliver forms for 0 and 1 as special cases, and normal forms otherwise. The scripted translation in PO file looks like this:

msgid "_n: Commissioned one ship in total.\n"
"Commissioned %n ships in total."
msgstr ""
"Porinut je ukupno %n brod.\n"
"Porinuta su ukupno %n broda.\n"
"Porinuto je ukupno %n brodova."
"|/|"
"$[plural-zero-one %n "
"    \"Nije porinut nijedan brod.\" "
"    \"Porinut je samo jedan brod.\"]"

Here, function plural-zero-one will take the plural argument and two special forms, for 0 and 1. This function is defined in the scripting module like this:

(define-public (plural-zero-one numstr spec0 spec1)
    (let ((num (string->number numstr)))
        (cond
            ((= num 0) spec0)
            ((= num 1) spec1)
            (else (msgstr)))))

plural-zero-one first converts the argument to Guile number (remember, all arguments to scripts are passed as strings), checks if it is 0 or 1, and returns appropriate special plural form if so. Otherwise, it returns result of msgstr function, which is the non-scripted translation, and is exactly one of the ordinary plural forms that we need in this case.

Benchmark results

The performance is measured in thousands of translated messages per second, kmsg/s. This measurement unit gives nice numbers to compare, but to it one can also associate a certain absolute meaning: if a menu with 50 entries mustn't spend more than 50 ms (0.05 seconds) getting its translations, then i18n layer must be able to deliver round performance of 1 kmsg/s.

Aside from pitting the scripted against non-scripted KDE, also of interest is scripted KDE having scripting turned off by the user (so he gets only ordinary translations).

The results on an AthlonXP 2000 (1.67 GHz) machine, using GCC 4.0.1, are shown in the following tables:

Table1.Performance in kmsg/s

SIMPLEFORMSPLURALS
Non-scripted KDE327.87188.6860.61
Scripted KDE251.576.779.82
Scripted KDE, scripting off258.73124.2249.75

On a not too fast machine by today's standards, in the worst case (FORMS) the scripting system was able to deliver 6.77 kmsg/s, much more than what is tentatively needed (1 kmsg/s). Furthermore, results show that when scripting is turned off, user will certainly not be able to feel any difference in performance.

One other thing of possible interest is the distribution of total execution time between the run of Guile interpreter and all i18n code executed arround it. This distribution is given by following table.

Table2.Distribution of execution time

FORMSPLURALS
Interpreter71%48%
C++ code29%52%

Security limitations

Due to the fact that KDE allows translation data to come from user side, the scripting system might introduce security holes. Therefore, scripting call is not executed if current effective user ID is root and real user ID is not root.

Thus, some applications, like Kdm, will not be able to use scripting system. This is not a big limitation, as such applications are rare and typically heavily administration oriented.

Additional Notes on I18n to Programmers

With introduction of translation scripting system, it turned out that instructions on programming for i18n were no longer as adequate. All those instructions still hold by all means, but some points need to be made more strict.

Using full sentences revisited

The fact that full sentences should be used for user visible messages, rather than convenient, but non language-portable cuts and splits, is by know well adhered to in KDE code. With scripting, arguments to messages are in translator's scope as well, which makes some previously perfectly i18n friendly code no longer such. Consider this code snippet from appletop_mnu.cpp:

QString text = isButton ? (isMenu ? i18n("&Move %1 Menu") :
                                    i18n("&Move %1 Button")) :
                          i18n("&Move %1");
insertItem(SmallIcon("move"), text.arg(title), Move);

Programmer here assumes that he can first get text as a translation of one of basic messages in first statement, and then add argument title to it in second statement. But in translation scripting enabled KDE, since arg() methods of ordinary QString are "dumb", this would deprive translator of title to operate on (perhaps he would like to put the title in correct case?)

To make this scripting friendly, programmer should put arguments together with i18n calls, because arg() methods of KI18n are "smart":

QString text = isButton ? (isMenu ? i18n("&Move %1 Menu").arg(title) :
                                    i18n("&Move %1 Button").arg(title)) :
                          i18n("&Move %1").arg(title);
insertItem(SmallIcon("move"), text, Move);

In summary, it is really the same rule, but with a clarification: Do not split sentences, and since arguments are a part of sentence, do not split them either.

If splitting out arguments is really convenient for some reason, it can be done provided that the type of the message string is changed from QString to KI18n:

KI18n text = isButton ? (isMenu ? i18n("&Move %1 Menu") :
                                  i18n("&Move %1 Button")) :
                        i18n("&Move %1");
insertItem(SmallIcon("move"), text.arg(title), Move);

This way the scripting system will also work as expected.

Wrapping all user visible strings revisited

A rule also closely adhered to is that every user visible string should be wrapped in i18n call. But what exactly constitutes "user visible string"? It went without saying that user visible string is any literal constant possibly visible to user, i.e. one which is extracted by tools for creating translation templates.

This is not exactly true, but the difference was not important before translation scripting arrived. Here is an example from konq_guiclients.cc:

QDomElement subMenu = m_doc.createElement( "menu" );
...
text.appendChild( m_doc.createTextNode( i18n( "Preview In" ) ) );
...
for (; it != end; ++it, ++idx )
{
  addEmbeddingService( subMenu, idx, (*it)->name(), *it );
  inserted = true;
}

Programmer here adds to a menu new submenu, "Preview In", and in following loop, adds some entries to that submenu. Titles of those entries are retrieved by (*it)->name() (application names actually), and are not enclosed in i18n call, although they are user visible.

So, the rule to wrap all user visible string has been neglected here, but it didn't matter, because translator couldn't have really done anything with it. Not so with scripting enabled, now translator can put application names in submenu into correct case for logical sentence "Preview In Appname". Programmer should therefore modify the loop in the following way:

for (; it != end; ++it, ++idx )
{
  addEmbeddingService( subMenu, idx,
    i18n("Preview in -> Appname", "%1").arg((*it)->name()), *it );
  inserted = true;
}

Using context info variant of i18n call, programmer explains the situation and supplies application name as argument for translator to handle.

Rule of wrapping user visible strings in i18n calls did not change either. It just became more consistent: Wrap all user visible strings, be they literal constants or not.

Acknowledgments

Kudos to Federico Cozzi for pointing out some solutions used here early on, which I was quick to dismiss at that time :)

Nicolas Goutte and Krzysztof Lichota discussed many ideas for how to revise i18n interface for applications.

Krzysztof Lichota also swayed me to use of more intuitive interpolated syntax, and proposed the approach for dynamic setting of forms and attributes.