Training Microsoft Translator NMT to translate Cryptocurrency Whitepapers – Linh Nguyen

Crypto Investing in 2022

One of my principles as an investor is DYOR, or “Do your own research.” The markets are teeming with misinformation, peddlers, “pump n’ dump”ers, and… those who are subscribed to Wall Street Bets.

As the popularity, or should I say, hype of cryptocurrency as a form of investment increases, so does the risk of newbie investors following the pack without doing their due diligence on projects. It is best practice to “know what you’re investing in.” So, often, wise investors will read the whitepaper set out by the coin or project in order to understand the logic behind the project and determine whether it has actual viability or if it’s just a rug pull (See Squid Game Is Memecoin Warning With Wipeout After 230,000% Gain) or a project that works better in concept (See The LUNA and UST Crash Explained in 5 Charts).

The Need for non-English Cryptocurrency Whitepapers

However, the majority of blockchain projects are in English, and nearly all whitepapers are written in English – despite the number of global crypto users rising to over 300 million worldwide in 2021. The better access crypto users have of the projects and whitepapers in their language, the easier it is for them to make informed investment decisions.

A Goal of Training an English to Chinese Cryptocurrency Machine Translation Engine

With the goal of promoting a return to traditional investment practices of doing due diligence, I and a team of four set out to train a neural machine translation engine on translating cryptocurrency whitepapers. We began with an English to Chinese language pair. While the Chinese government has tightened regulations on cryptocurrency, there is no doubt that China will be a big player in the crypto/blockchain industry one day; there is speculation that China is also developing its own digital currency.

Step 0: Determining Pilot Objectives

To our knowledge, nothing of this sort has ever been done. We were going in cold.

The best thing to do when you don’t know what to do is determine where you want to end up and backtrack the steps to draw a roadmap.

We laid out three goals for the MT engine.

Pricing Goals	PEMT 30% cheaper than human translation
Timing Goals	PEMT 40% faster than human translation
Quality Goals	PEMT has no critical errors and errors in other categories are below 50 points per 1,000 words based on a modification of the MQM quality model.

You’ll see in the video but we successfully achieved the timing goal and were very close to achieving the pricing and quality goals.

Step 1: Gathering Data

The first step in training a neural machine translation engine is gathering the necessary data to feed the engine. This proved to be slightly difficult. For starters:

There are no high quality (if any at all) publicly available corpuses focused on cryptocurrency or blockchain
Documents were scattered across project sites, pdfs, and tmx files

We had to manually convert, align, and clean whitepapers found online to come up with the training, testing, and tuning data for the MT training.

Step 2: Train MT engine and Evaluate (and repeat…)

This is our workflow for the project.

While actually training the engine isn’t too difficult (a mere matter of uploading and choosing the correct files and waiting for the engine to train itself), it’s deciding on what elements to change for the next iteration in order to increase the BLEU score (an algorithm for evaluating the quality of machine-translated text) that proved difficult.

Should I add more whitepapers to the training data? Should I change the tuning data? Which translation units (individual translatable segments) should I delete from the data to keep it more relevant? How can I balance high quality data and my sanity?

I suggested we keep track of data cleaning filters like below:

Action	Conditional Expression
Delete incomplete sentences	Text_EN NOT LIKE ‘%.%’
Delete source and target TUs with less than 10 characters	LEN(Text_EN) < 10 LEN(Text_ZH_HANS) < 10
Delete source and target TUs with more than 200 characters	LEN(Text_EN) > 200 LEN(Text_ZH_HANS) > 200
Delete duplicate source and target TUs	Text_EN = Text_ZH_HANS
Delete text where either text is 10 times longer than the other	LEN(Text_EN)10 < LEN(Text_ZH_HANS) LEN(Text_ZH_HANS)10 < LEN(Text_EN)
Delete links	Text_EN LIKE ‘%www%’
Delete empty source and/or target	Flag entry
Delete duplicate TUs in the source or target	Flag entry

Evaluation of the MT is also crucial to maintaining quality. It’s not enough to rely on a BLEU score and think your translation is good. Getting a human evaluator to actually look at the translation is crucial to picking out the blind spots in the engine and deciding on what changes to make in the following iterations.

Our team prepared our lessons learned in a video presentation below.