CHALLENGING MACHINE TRANSLATION ENGINES: SOME SPANISH-ENGLISH LINGUISTIC PROBLEMS PUT TO THE TEST

Aguilar, Argelia Peña

doi:10.5007/2175-7968.2023.e85397

Abstract

This work is an evaluation of machine translation engines completed in 2018 and 2021, inspired by Isabelle, Cherry & Foster (2017), and Isabelle & Kuhn (2018). The challenge consisted of testing MTs Google Translate and Bing and DeepL in the translation of certain linguistic problems normally found when translating from Spanish into English. The divergences representing a “challenge” to the engines were of morphological and lexical-syntactical types. The absolute winner of the challenge was DeepL, in second place was Bing from Microsoft, and Google was the engine that was the poorest in the management of the linguistic problems. In terms of time, when comparing the engines three years apart, it was found that DeepL was the only one that enhanced its performance by correcting a problem it had before in a test sentence. This was not the case for the other two, on the contrary, their translations were of lower quality. These machines do not seem to be consistent in the manner in which they are improved. These findings may be valuable for translators who may work with these systems as pre or post-editors so that their efforts may be better directed.

Keywords
Machine Translation; Pre-editing; Post-editing; Google Translate; Bing; DeepL

Introduction

“Machine Translation Systems are perhaps the electronic translation tools that attract the most public attention, especially among non-translators” (Austermühl & Kortenbruck, 2001Austermühl, Frank & Kortenbruck, Anke. “A Translator’s Sword of Damocles? An Introduction to Machine Translation”. In: Austermühl, Frank. Electronic Tools for Translators. London: Routledge, 2001/2014. p. 153-176., p. 153); and it seems they will keep on drawing the attention of all kinds of users, but most particularly, that of translators/language professionals. The improvements made each time to these engines are making them superior to their previous version and these enhancements will only make them more acceptable for professional use.

Besides the upgrades, regular evaluations should also be made to test their performance in different situations. These are usually done not only by the developers of such software, but also by the people who use (or want to use) these systems. There are different approaches to assess the quality of machine translation (MT), but other considerations may be relevant:

Reasons to use it (communication, publication, “gisting” or enabling meaning)
Standard of quality (“professional”, “human parity”, “fit-for-purpose”, “good enough”)
Evaluators (developers, translators, professors, students)
Consequences of quality expectation (direct and indirect, short-long term, stakeholders, entities)
Aspect being evaluated (a sentence at a time, a paragraph, specific linguistic features)
Other factors (type of MT, domain, text type, language pair)
User acceptance (preference, use, perceptions)
Automatic metrics vs. human measures (Marshman, 2018Marshman, Elizabeth. Evaluating MT. Ottawa: University of Ottawa, 2018. p. 1-20., p. 3-17).

From an engineering perspective, Philipp Koehn in his book Neural Machine Translation (2020), provides an account of the myriad of methods to evaluate the progress of machine translation over time and the quality of the translation it produces. Stakeholders in language industry, computational linguistics, engineering, and translation companies, consider these to be, ‘best practices.’

Koehn classifies the forms of MT assessment into three:

Task-based evaluation (which include real-world tasks, content understanding and translator productivity).
Human assessment (adequacy and fluency; ranking; continuous scale; crowd-sourcing evaluations; human translation edit rate).
Automatic metrics (BLEU, The Meteor metric, TER, characTER, and Bootstrap Resampling).

Incorporating Human evaluation is absolutely a paramount method for testing MT engines, as can be noticed from the above-mentioned considerations.

Testing MT engines has been a very fertile research area for some years, and there are some relevant studies in Human assessment and post-edition which relate to translators work. Brita Banitz (2020)Banitz, Brita. “Machine Translation: A Critical Look at the Performance of Rule-Based and Statistical Machine Translation”. Cadernos de Tradução, 40(1), p. 54-71, 2020. DOI: https://doi.org/10.5007/2175-7968.2020v40n1p54
https://doi.org/10.5007/2175-7968.2020v4... , for instance, used Human assessment along with TER score measure to evaluate rule-based and statistical machine translation and explained in an essay their performance with English-German phrases taken from Mark Twain’s, The Awful German Language.

In a similar vein, the relevant work done by Coraline Doan (2021)Doan, Coraline. Comparing Encoder-Decoder Architectures for Neural Machine Translation: A Challenge Set Approach. 2021. 274 f. Thesis (Master in Translation Studies) – University of Ottawa, Faculty of Arts, School of Translation and Interpretation, Ottawa, 2021. in her thesis titled, Comparing Encoder-Decoder Architectures for Neural Machine Translation: A Challenge Set Approach, focuses on Human evaluation of MT engines from a translator’s perspective. She designed the methodology inspired by the precursors of challenge sets (as in this paper): Isabelle, Cherry & Foster (2017)Isabelle, Pierre; Cherry, Colin & Foster, George. “A Challenge Set Approach to Evaluating Machine Translation”. In: Conference on Empirical Methods in Natural Language Processing, 55., 2017, Copenhagen. Proceedings […]. Copenhagen, Denmark: Association for Computational Linguistics, 2017. p. 2486-2496. DOI: https://doi.org/10.18653/v1/D17-1263
https://doi.org/10.18653/v1/D17-1263... . The set was for English to French MT translations and employed Jean Delisle and Marco Fiola’s, La traduction raisonnée (3^rd edition) to that end.

In another thought-provoking research, Guerberof-Arenas & Toral (2020, p. 254)Guerberof-Arenas, Ana & Toral, Antonio. “The Impact of Post-Editing and Machine Translation on Creativity and Reading Experience”. Translation Spaces, 9(2), p. 255-282, 2020. DOI: https://doi.org/10.1075/ts.20035.gue
https://doi.org/10.1075/ts.20035.gue... , asked readers to evaluate three different translations of a fictional story from English into Catalan. These were in three forms: machine translation, post-edited, and translated without aid (human translation). The creativity of those translations were evaluated from the readers’ viewpoint and, as might be expected, the creativity was reported to be the highest when translators were involved in the process.

On the subject of post-edition studies, Parra Escartín & Goulet (2021), did an interesting enquiry on post-edition for publication purposes, but not performed by professional translators. Their aim was, “to determine whether the physician-participants would be in a position to submit research papers for publication using a general machine-translation engine followed by post-editing” (Parra Escartín & Goulet, 2021Parra Escartín, Carla & Goulet, Marie-Josée. “When the Post-Editor Is Not a Translator”. In: Koponen, Maarit; Mossop, Brian; Robert, Isabelle S. & Scocchera, Giovanna (Eds.). Translation Revision and Post-Editing: Industry Practices and Cognitive Processes. London: Routledge, 2021. p. 89-106., p. 91). The results indicated that the quality of such post-edition would not be good enough for the said purpose, which made clear that the MT versions would have to be post-edited by a language professional.

As the previous examples suggest, human involvement in the machine translation process, be it as an evaluator, or a post-editor (evidently as a translator, too) bring about better outcomes.

Our approach

Bearing in mind that, “even with ongoing automation in many aspects of translation service, revision and post-editing rely on human skill and expertise” (Konttinen, Salmi & Koponen, 2021Konttinen, Kalle; Salmi, Leena & Koponen, Maarit. “Revision and Post-Editing Competences in Translator Education”. In: Koponen, Maarit; Mossop, Brian; Robert, Isabelle S. & Scocchera, Giovanna (Eds.). Translation Revision and Post-Editing: Industry Practices and Cognitive Processes. London: Routledge, 2021. p. 187-202.), we believe it is crucial that human translators are able to assess MT engines by themselves and learn from other experiences on how to achieve this. To this end, the emphasis of this work revolves around MT testing and assessment by a human translator-the author.

The evaluation of machine translation engines for this paper will be examined considering the previous research of Pierre Isabelle, Colin Cherry & George Foster (2017)Isabelle, Pierre; Cherry, Colin & Foster, George. “A Challenge Set Approach to Evaluating Machine Translation”. In: Conference on Empirical Methods in Natural Language Processing, 55., 2017, Copenhagen. Proceedings […]. Copenhagen, Denmark: Association for Computational Linguistics, 2017. p. 2486-2496. DOI: https://doi.org/10.18653/v1/D17-1263
https://doi.org/10.18653/v1/D17-1263... in their study entitled, “A Challenge Set Approach to Evaluating Machine Translation”, and the investigation conducted by Pierre Isabelle & Roland Kuhn (2018)Isabelle, Pierre, & Kuhn, Roland. “A Challenge Set for French ? English Machine Translation”. arXiv preprint arXiv:1806.02725, 2018. DOI: https://doi.org/10.48550/arXiv.1806.02725
https://doi.org/10.48550/arXiv.1806.0272... , named, “A Challenge Set for French → English Machine Translation”. Both studies set the course of action for this paper as described below.

The first challenge was completed on November 25, 2018, (the second, three years later in 2021), and consisted of testing MTs Google Translate, Bing and DeepL for the translation of certain linguistic problems normally found when translating from Spanish into English. These divergences were thought to represent a “challenge” to the engines and, without doubt, the findings would aid in determining the quality of the MTs assessed.

To be able to elucidate the examples, and subsequently, the results in the next chapter, the writer has presented tables with all these linguistic problems. The three machines evaluated for this task are web-based and the belief is that they produce high-quality language translations. Google Translate and DeepL have neural architecture, but Bing seemed to use a statistical approach for the free version, although the company that produces it, (Microsoft), has announced advances in their neural version and probably has already released it. No matter what their design is, we expect interesting results.

Each table displays a particular problem and a single test sentence. There are examples of the performance of the three engines that are part of this challenge and the human evaluation is done by using a ✔ or a ✖ on the right side of the machine translations. Also, the adequate sentences are in bold for easier identification. The assessment was based on the proximity to the reference translation provided for each test sentence and the way these MTs handle the linguistic problem in turn was also considered. Unlike the other two, DeepL is the only MT that provided more than one example, and all of them were included in the chart. If one of the options was the right one, then it was considered correct. Why? Because the machine was providing “options” to the translator, who was ultimately the one who would select the appropriate one accordingly. In the beginning (and at the end) of the Results and Discussion section, there will be a summary table (years 2018 and 2021) of the performance of Google Translate, DeepL and Bing which visually helps to identify trends or general execution of these engines.

Challenge set

The following evaluation was done by the author essentially considering the performance of the engines regarding the linguistic phenomena presented. Ultimately, the author will give further suggestions taking into account the MTs feasibility after the challenge experience. Also, to clarify, the sentences used for this task were designed specifically for it. Therefore, they are intentionally short and focused on phenomena that the author has found to be challenging for human translators when translating from Spanish into English. The question here is to determine how well Google Translate, Bing and DeepL handle these same problems, in order to resolve their quality in execution.

The challenge set consisted of five language structures and are categorized into two types, morphological and lexical-syntactical. These constructions are typical or standard in the source language but unusual in the target language and for this reason they can become challenging to translate for the selected engines. A brief explanation about how different every single language construction is in Spanish and English will be in every table where there is an evaluation of the linguistic problem.

For each language structure three example sentences in the source language and reference (correct human) translations are provided. Lastly, three different machine-translated versions of the sentences are also included. In this way, the corpus will consist of 15 Spanish sentences, 15 English (human) reference translations and 45 machine translations of the source-language sentences (3 times 15 sentences).

A summary of the 15 test sentences is featured below (the machine translations are not included):

1. Sudden proposals in present tense to future statements
Spanish	English
Test sentences	Reference translations
Te llevo la maleta	→ I’ll carry that case for you
¿Se lo envuelvo?	→ Shall I wrap it for you?
Yo lavo los platos hoy	→ I’ll wash the dishes today

2. Present tense to present continuous for future statements
Spanish	English
Test sentences	Reference translations
Me voy mañana a Paris	→ I am leaving/going to Paris tomorrow
¡Te casas pronto!	→ You are getting married soon!
Salimos de viaje en una hora	→ We are leaving on a trip in an/ one hour

3. Inalienable possession
Spanish	English
Test sentences	Reference translations
El pelo le llega a los hombros	→Her hair falls just to her shoulders
¿Te cepillaste el cabello con cuidado?	→Did you brush your hair carefully?
La mujer ladeó un poco la cabeza	→The woman tilted her head a little

4. Definite article in Spanish to zero article in English
Spanish	English
Test sentences	Reference translations
¿Qué es la inmortalidad?	→ What is immortality?
Agradezco a la vida lo que tengo	→ I thank life for what I have
La política otorga poder a unos pocos	→ Politics gives power to a few

1. Countable vs. Uncountable nouns
Spanish	English
Test sentences	Reference translations
Compré dos muebles para la sala	→ I bought two pieces of furniture for the living room
Tuvimos un clima agradable el mes pasado	→ We had nice weather last month
Dame consejos para ser mejor	→ Give me (some) advice on how to be a persona better person

Morphological type													Lexical-syntactical type
Linguistic Problems	1. Sudden proposals			2. Present tense to present continuous			3. Inalienable possession			4. Definite article to zero article			5. Countable vs. uncountable nouns
Test sentences	1	2	3	1	2	3	1	2	3	1	2	3	1	2	3
Google Translate	✔	✖	✖	✔	✖	✖	✖	✔	✔	✔	✔	✔	✖	✖	✔
DeepL	✔	✔	✖	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✖	✔
Bing	✔	✔	✖	✔	✔	✖	✖	✔	✖	✔	✔	✔	✖	✖	✔

Problem 1 Sentence 1	Sudden proposals in present tense to future statements (morphological type)	Evaluation
In actions that are suddenly proposed and the speaker also seeks for approval, present tense is used in Spanish; the tendency is to avoid present tense in English and use future forms Will or Shall.
Source sentence	Te llevo la maleta
Reference translation	I’ll carry that case for you
Google Translate	I’ll take your suitcase	✔
DeepL	I’ll take your suitcase, I’ll take your bag, I’ll bring your suitcase	✔
Bing	I’ll take your suitcase	✔

Problem 2 Sentence 1	Present tense to present continuous for future statements (morphological type)	Evaluation
Spanish expresses future with the simple present tense but in English present continuous is the correct tense to express near future for similar statements.
Source sentence	Me voy mañana a Paris
Reference translation	I am leaving/going to Paris tomorrow
Google Translate	I am going to Paris tomorrow	✔
DeepL	I am going to Paris tomorrow (just 1 option)	✔
Bing	I am going to Paris tomorrow	✔

Problem 2 Sentence 2	Present tense to present continuous for future statements (morphological type)	Evaluation
Spanish expresses the future with the simple present tense but in English present continuous is the correct tense to express near future for similar statements.
Source sentence	¡Te casas pronto!
Reference translation	You are getting married soon!
Google Translate	You get married soon!	✖
DeepL	You’re getting married soon!, get married soon!	✔
Bing	You’re getting married soon!	✔

Problem 2 Sentence 3	Present tense to present continuous for future statements (morphological type)	Evaluation
Spanish can express the future with simple present tense but in English present continuous is the correct tense to express near future for similar statements.
Source sentence	Salimos de viaje en una hora
Reference translation	We are leaving on a trip in an/one hour
Google Translate	We went on a trip in one hour	✖
DeepL	We leave in an hour (just 1 option)	✔
Bing	We went on a trip in an hour	✖

Problem 3 Sentence 1	Inalienable possession (morphological type)	Evaluation
Where in Spanish we use definite articles to refer to parts of the body, English uses the possessive for all the references to parts of someone’s body.
Source sentence	El pelo le llega a los hombros
Reference translation	Her hair falls just to her shoulders
Google Translate	The hair reaches the shoulders	✖
DeepL	The hair reaches his shoulders, the hair reaches her shoulders	✔
Bing	The hair comes to the shoulders	✖

Problem 4 Sentence 1	Definite article in Spanish to zero article in English for general concepts (morphological type)	Evaluation
In English, the use of articles is avoided in general concepts like Life, Immortality, Resurrection, among others, whereas in Spanish definite article should be utilized.
Source sentence	¿Qué es la inmortalidad?
Reference translation	What is immortality?
Google Translate	What is immortality?	✔
DeepL	What is immortality?	✔
Bing	What is immortality?	✔

Problem 5 Sentence 1	Countable vs. uncountable nouns (lexical-syntactical type)	Evaluation
In Spanish some nouns can be counted or pluralized, but in English some of these same nouns should always be in singular form or are not naturally countable, so expressions of quantity have to be added.
Source sentence	Compré dos muebles para la sala
Reference translation	I bought two pieces of furniture for the living room
Google Translate	I bought two furniture for the living room	✖
DeepL	I bought two pieces of furniture for the living room	✔
Bing	I bought two furniture for the living room	✖

Morphological type													Lexical-syntactical type
Linguistic Problems	1.Sudden proposals			2. Present tense to present continuous			3.Inalienable possession			4.Definite article to zero article			5. Countable vs. uncountable nouns
Test sentences	1	2	3	1	2	3	1	2	3	1	2	3	1	2	3
Google Translate	✔✖	✖✖	✖✖	✔✔	✖✖	✖✖	✖✖	✔✖	✔✔	✔✔	✔✖	✔✔	✖✖	✖✖	✔✖
DeepL	✔✔	✔✔	✖✖	✔✔	✔✔	✔✔	✔✔	✔✔	✔✔	✔✔	✔✔	✔✔	✔✔	✖✔	✔✔
Bing	✔✖	✔✖	✖✖	✔✔	✔✖	✖✖	✖✖	✔✔	✖✖	✔✔	✔✔	✔✔	✖✔	✖✔	✔✔

	✔ Correct	✖ Incorrect
Google Translate	53%	47%	2018
Google Translate	26%	74%	2021
DeepL	87%	13%	2018
DeepL	93%	7%	2021
Bing	60%	40%	2018
Bing	53%	47%	2021

Problem 5 Sentence 3	Countable vs. uncountable nouns (lexical-syntactical type)	Evaluation
In Spanish some nouns can be counted or pluralized, but in English some of these same nouns should always be in singular form or are not naturally countable, so expressions of quantity have to be added.
Source sentence	Dame consejos para ser mejor persona
Reference translation	Give me (some) advice on how to be a better person
Google Translate	Give me advice to be a better person	✔
DeepL	Give me advice on how to be a better person	✔
Bing	Give me tips to be a better person	✔

Brasil

Brasil

CHALLENGING MACHINE TRANSLATION ENGINES: SOME SPANISH-ENGLISH LINGUISTIC PROBLEMS PUT TO THE TEST

Abstract

Introduction

Our approach

Challenge set

Morphological type

Lexical-syntactical type

Results and discussion

2021 Assessment

Conclusions

Acknowledgements

References

Publication Dates

History