Drupal and TMGMT in Localization
A few days ago, I found myself working with one of my teammates on a new web localization project for one of our non-profit clients.
Drupal, an open-source CMS, is the usual web platform when working with this type of client.
In the past, we had recommended the client’s dev team to restructure their website to support TMGMT, a web localization toolkit for Drupal websites.
For them, it has been a game changer since they used to maintain hundreds of bilingual Word files (one per page). Now, they only have to export and import translations using TMGMT’s interface.
Briefly, TMGMT maps the entire site and assigns ID numbers to every translatable string. This way, the webmaster can export an XLF (XLIFF) file containing a given set of strings—also called “jobs”—that it’s compatible with CAT tools.
Dissecting TMGMT’s XLF Files
The interesting aspect of these XLF files is their internal structure – a mix of XML and HTML entities.
Paragraphs
<trans-unit id="99][body][0][value" resname="99][body][0][value">
<source xml:lang="en"><p>"Source text goes here"</p></source>
<target xml:lang="es"/>
<note>Summary > Text</note>
</trans-unit>
Hyperlinks
<trans-unit id="100][field_sections][1][entity][field_content][1][entity][field_textarea][0][value" resname="100][field_sections][1][entity][field_content][1][entity][field_textarea][0][value">
<source xml:lang="en"><p>Visit <a href="https://www.localizationtimes.com">🔵 Anchored text 🔵</a> for the most up-to-date information.</p></source></source>
<target xml:lang="es"/>
</trans-unit>
Tables and images
<trans-unit id="100][field_image][0][entity][field_media_image][0][alt" resname="100][field_image][0][entity][field_media_image][0][alt">
<source xml:lang="en">This is the alternative text of an image</source>
<target xml:lang="es"/>
<note>Card/Search Image > Image > Alternative text</note>
</trans-unit>
<trans-unit id="99][field_sections][0][entity][field_right_column][0][value" resname="99][field_sections][0][entity][field_right_column][0][value">
<source xml:lang="en"><table><tbody><tr><td>&nbsp;</td><td><strong>Colorado Residents</strong></td><td><strong>Non-Residents</strong></td></tr><tr><td>Adults</td><td>$18</td><td>$22</td></tr><tr><td>Seniors (65 &amp; older)</td><td>$15</td><td>$19</td></tr><tr><td>Students (with ID)</td><td>$15</td><td>$19</td></tr><tr><td>Teachers (with ID)</td><td>$15</td><td>$19</td></tr><tr><td>Active Military &amp; Veterans (with ID)</td><td>$15</td><td>$19</td></tr><tr><td>Youth (18 &amp; younger)</td><td>Free</td><td>Free</td></tr><tr><td>Members</td><td>Free</td><td>Free</td></tr></tbody></table></source>
<target xml:lang="es"/>
</trans-unit>
CAT tools come with native file type configurations that simplify parsing. This removes the file engineering task from the project manager’s plate.
In the case above, Trados Studio will natively go to the translatable text within the “trans-unit” tags, but it will have a hard time dealing with the HTML entities.
You will get something like this:
<p>
“Source text goes here”</p>
<table><tbody><tr><td>&nbsp;</td><td>
Column Header</td></tr></tbody></table>
Our dear translators would cringe if they saw those HTML entities in a plain text fashion, so our task was to remove them.
As you can see, default parsers aren’t a one-size-fits-all solution, so human intervention will be fundamental to correctly parse these XLF files.
Parsing an XLF file From TMGMT in Trados Studio
By default, Trados Studio¹ recognizes our XLF file via its “XLIFF 1.2–1.2 v2.0” built-in parser.
You have two choices:
- Go with the built-in parser and add embedded content processing rules to deal with HTML entities.
- Create a new parser and leverage the HTML Embedded Content Processor.
In this opportunity, we went with option A and stripped HTML entities out of our text with regular expressions.
- Paragraphs:
\<p\>
and\<\/p\>
. - Headings:
\<h\d\>
and\<\/h\d\>
. - Tables:
\<(\/?)table\>
,\<(\/?)tbody\>
,\<(\/?)tr\>
, and\<(\/?)td\>
. - Carriage returns:
\<br\>
. It was safe to remove them. - Lists:
\<(\/?)li\>
(list items),\<(\/?)ul\>
(unordered list), and\<(\/?)ol\>
(ordered list). - Formatting:
\<(\/?)strong\>
and\<(\/?)em\>
. Additionally, we marked an option to make the formatting show in the translation editor without the tags. - Hyperlinks:
\<a.*?\>
and\<\/a\>
. This is enough to isolate hyperlinked, translatable text. - Images:
^.*\.jpg
Unlike alternative text, any references to image files must be taken out of sight from the translators. Otherwise, we run the risk of them mistakenly translating them, which would break the link between the page and the actual image stored in the CMS database. They can reference the website or a PDF snapshot of it for context.
The con of using option A is that HTML entities such as the ampersand (&) and non-breaking spaces (&nbps;
) will need to stay the same. We can’t convert them temporarily.
As a final note, you can create a list of expressions for all of the above. This will reduce the data entry time when configuring the embedded content processor.
Effective Parsing Makes Translators’ Lives Easier
Besides the obvious fact that not parsing these HTML entities would open the door to all kinds of mistakes during production (translation), translators don’t enjoy placing these things around.
Sometimes, this is inevitable, especially when working with dynamic variables. But if we spend enough time making sure the text in scope for localization is as clean as possible, we will be saving the day.
Effective parsing simplifies the translation process.
¹ We used Trados Studio 2021 SR2 for this process.