Intro
Recently, I was requested to find a way to build multilingual glossaries leveraging translation memory content. My client had translated an entire year of educational content using professional translators in seven different languages.
The challenge was bipartite:
- Build all the required glossaries in a relatively short timeframe.
- Have a final human review step to validate the terms without breaking the bank.
In other words, I had to create the glossary programmatically, using a pre-built list of English terms (source language), and then look for their translations, if any, in translation memories following an “exact match” approach.
My Approach to Building a Glossary With TMX Files
A Rookie Mistake and String/Sub-String Matching
Naturally, I started on the wrong foot: The Fuzz and Fuzzy Search. These Python libraries try to find partial/near or exact matches based on a familiar statistical approach: Levenshtein’s Distance. They are easy to use, but are extremely slow for large datasets (my case).
So, to give you an idea, finding ONE (🥲) exact match (case-sensitive) across 16,000 translation units took 45 seconds. To improve performance, and since I was working with Excel files, I tried the itertuples()
approach, and also pre-processed the source and target segments in a dictionary object.
I managed to improve the time down to ~30 seconds, but that wasn’t going to work on a list of terms beyond 600 entries. That’s when I realized my rookie mistake.
Python has built-in and ultra-fast string and substring matching, so why on earth was I using a third-party library for this? If my goal was to “filter out” those translation units without terms from my list, it was as simple as creating a conditional if in
. It worked!
Now that I had my term filter, it was time to look into the target segments for the translation.
Perplexity’s Sonar for Term Extraction
Besides Microsoft Copilot, my client uses Perplexity for daily GPT conversations. Right now, their Sonar base model charges around one or two bucks for 1M tokens, with a 1.3-second sleep time (Tier 0).
We are talking about ~13 minutes of processing time to build a 630-term glossary (per language). Since it manages well with multiple languages, I decided to give it a shot.
The workflow overview would be as follows:
- With a given list of terms…
- Filter the Source column of a translation memory in Excel format…
- Store the source and target columns in a separate dictionary object…
- Send a prompt to Sonar, indicating the matched “Term” (list), its “Context” (source), and the “Translation” (target), and request to pull the term from the “Translation.”
- Collect the “Term” and its “Translation” in a separate dictionary…
- Build a new Excel glossary out of the previously built dictionary.
To prevent it from sending generic introductions with the extracted term or fabricating the information, I set the temperature to 0 (deterministic), and added several negative statements.
It worked! In 13 minutes, I had an almost-perfect English-Amharic glossary of ~300 terms. The last thing I added was an interactive wizard for the client’s project managers, and gave a brief training and responsible use best practices.
Next Steps and Future Improvements
Although this MVP fulfills my client’s needs, there’s ample room for improvement, namely:
Include an auto-validation step to ensure the extracted terms are actually exact matches from the target column. This would be a great opportunity to use The Fuzz, since it can give a score we can transfer to the glossary.
After auto-validation, if there are low-score entries, provide a new translation using Google Translate or another fast LLM, and let the human reviewer decide.
I look forward to sharing the script via GitHub and updating this blog post in the coming days.
🔗 You can download the latest version of the script from my GitHub repository+.
Thanks for reading!