XLIFF: An Aid To Localization
Introduction
Translators today can expect to receive documents for translation in any one of several formats:
- HTML
- Docbook
- Microsoft Word (many possible versions)
- XML (many possible DTDs!)
- FrameMaker
- Software resource bundles (many different formats such as .properties, .po, .msg, .java, etc. )
- etc.
From a translator’s point of view, this is quite a difficult mix to deal with. You would need to maintain several editing tools, be proficient in many file formats (knowing the syntax and grammar of each type), and that’s before you’ve even started to translate the content.
As a localization engineer, a similar problem exists: it’s difficult to write tools for each file format. For example, if your boss asks you to calculate the number of new words for translation between the last delivery and the current one, you need a tool capable of dealing with all formats or a separate tool for each format.
Normally during localization, files are processed by tools such as translation memories and machine translation systems. Translation memory systems, known as TM systems, work by looking up segments in a database containing a large number of previously translated segments and their translations. (Segments are pieces of source files, usually sentences, that can be translated reasonably independently.) The database might contain segments that match the input segment exactly or segments that are similar to the segment presented for translation. These translations are then provided to the translator as suggested translations for each segment.
Machine translation systems, known as MT systems, are another type of translation technology. Instead of using a large database of existing translation, a machine translation system uses a set of language-specific linguistic rules that describe how to translate sentences into the target language.
The translations these systems produce might undergo some post editing, and any remaining untranslated text is given to human translators to complete. Translations are then reviewed, and sometimes commented on, corrected, or retranslated. Source formats tend not to have support for these localization processes.
XLIFF, which stands for XML Localization Interchange File Format, is a format for exchanging localization data. XLIFF could be used to exchange data between companies, such as a software publisher and a localization vendor, or between localization tools, such as TM systems and MT systems.
What is XLIFF?
XLIFF is an XML-based format that enables translators to concentrate on the text to be translated. Likewise, since it’s a standard, manipulating XLIFF files makes localization engineering easier: once you have converters written for your source file formats, you can simply write new tools to deal with XLIFF and not worry about the original file format. It also supports a full localization process by providing tags and attributes for review comments, the translation status of individual strings, and metrics such as word counts of the source sentences.
The XLIFF format grew out of a collaboration between a number of companies, including Sun Microsystems, but was soon brought under the management of an OASIS Technical Committee. In April 2002, the first Committee Specification for XLIFF was published. This is available at http://www.oasis-open.org/committees/xliff/documents/xliff-specification.htm.
The XLIFF format aims to:
- Separate localizable text from formatting.
- Enable multiple tools to work on source strings and add to the data about the string.
- Store information that is helpful in supporting a localization process.
Summary
In summary, XLIFF aids localization in a number of ways.
- XLIFF removes the complexities of localizing different types of source files.
- XLIFF provides a common platform for localization tools vendors to write to, thus increasing the number of tools available.
- XLIFF highlights the parts of a file that are important to the localization process.
- XLIFF provides support to the localization process, through its commenting features, support for phases, and metrics.
Authors: John Corrigan and Tim Foster are software engineers working on translation technologies at Sun Microsystems.
Also by Tim Foster: Translation Technology at Sun.