Presentations

(alphabetical order of presenters)

Irina Burukina

 

Irina Burukina  completed her Diploma with Honours (BA equivalent) in June 2012 at the St. Petersburg State University, Department of Linguistics where she majored in Mathematical Linguistics; her final qualifying paper: “Integration of multiword expressions in the RussNet thesaurus structure”.

Then, in June 2014, she graduated from the Russian State University for the Humanities, Institute for Linguistics; with a Diploma with Honours (MA equivalent). She majored in  Computational Linguistics; her final qualifying paper: “Syntax of implicit possessives in Russian. Implicit possessives recognition and translation”. Irina gained other research experience working 2010 and 2012 as a Research assistant in the RussNet project: automatic extraction of collocations in Russian (semantic and statistical approaches), and 2013 as a Research assistant in the General Internet Corpus of Russian project.

Her working experience:
June 2012 – present. Linguist, Lexical Semantics Group, Technology Development Department; ABBYY Headquarters (Linguistic Software Company), Moscow, Russia.
June 2011 – May 2012. Linguist, Tree Syntax Group, Technology Development Department. ABBYY Headquarters (Linguistic Software Company), Moscow, Russia.

Translating implicit elements in RBMT

This research is focused on Ru <-> En RBMT of asymmetrical linguistic markers. In English overt pronouns are required to mark possessive relations. However in Russian implicit possessives are regularly used. I argue that Russian implicit possessives should be treated as overt pronouns, that is, they should be recognized in text and attributed to their appropriate antecedent. At En -> Ru translation in several cases overt pronouns should be deleted and at Ru -> En translation in several cases explicit possessives should be synthesized subject to their antecedents. I examine how modern Russian MT systems deal with these tasks and explore main problems that exist. I also introduce the rule for En -> Ru MT that helps to increase translation accuracy.The research is based on ABBYY Compreno © machine translation technologies.

Implicit possessives

Implicit possessives (IP) are zero possessive pronouns, used with inalienable nouns (kinship terms and body-part nouns) in the positions that can be occupied by overt possessives (pronominals or reflexives). They can be used as either deictic elements (when they point out entities in non-linguistic “real” context) or proper anaphors (when they are coreferential with entities mentioned in the same text).

(1) a. Mamy        net v   dome.
Mother.GEN not in house.PREP
‘My mother is not in the house.’
b. Chto  sluchilos’?    Ruka           bolit?
What      happened    Hand.NOM aches
‘What happened? Is your hand aching?’
c. Petya        pozvonil    mame.
Peter.NOM  called         mother.DAT
‘Peter called his mother.’

Ru <-> En automatic translation

I argue that taking into account IP is useful and helps to increase the effectiveness of Ru <-> En RBMT systems. Two tasks should be considered. First, at En -> Ru translation overt pronouns should be deleted to get more accurate results. Second, at Ru -> En translation explicit possessives should be synthesized instead of implicit considering properties of their antecedents.

All of the currently existing Ru <-> En machine translation systems (rule-based as well as statistics-based) show rather poor results analyzing constructions with IP.

(2) Petya           podoshel    k    mame.
Peter.NOM came up     to   mother.DAT
‘Peter came up to his mother.’
(Google, SBMT) Peter went to my mother.
(Yandex, SBMT) Petya went to her mother.
(PROMT, RBMT) Petya approached to mother.
(Compreno, Hybrid MT) Petya walked over to the mother.

(3) Peter will be happy if Masha talks to her mother.
(Google) Питер будем рады, если Маша разговаривает с матерью.
‘Peter (we) will be glad if Masha talks to mother.’
(Yandex) Питер будет счастлив, если бы Маша говорит ей мать.
‘Peter will be happy if Masha talks to her (non-reflexive) mother.’
(PROMT) Питер будет счастлив, если Маша будет говорить со своей матерью.
‘Peter will be happy if Masha is talking to her (reflexive) mother.’
(Compreno) Питер будет счастлив, если Маша будет говорить со своей матерью.
‘Peter will be happy if Masha is talking to her (reflexive) mother.’

SBMT can not treat IP properly; there is no opportunity for anaphora resolution and appropriate overt pronouns deletion. In most modern RBMT systems there are no rules for IP.

Problems

There are several major problems.

First, it is impossible to formalize pragmatics. Implicit possessives are ambiguous. In many cases (especially in direct speech) they may be used in deictic function to point out the speaker himself.

Second, even though I have analyzed IP in different context and discovered their main properties, these properties should be considered regularities but not strict rules. There are always exceptions that can be crucial for the evaluation.

Third, semantics of predicates should also be taken into account.

Proposed rule

In English there are no reflexives and only pronouns like his, her are used. The sentences with such pronouns can be ambiguous.

(4) Bill asked John to call his father.

However, translating such sentences to Russian a system should unambiguously identify the antecedent; either reflexive svoi or pronominal is used. One of interpretations is missed and incorrect translation can be gotten.

I propose the rule that should be used at the analysis stage during En -> Ru MT.

In simple sentences if the subject (A) and the direct object (B) share the same grammatical and semantic characteristics, the kinship term with overt possessive is used in the same sentence and A or B is chosen as antecedent to that possessive, the pronoun should be deleted.

In composite sentences if the subject of the main clause (A) and the subject or the direct object of the subordinate clause (B) share the same grammatical and semantic characteristics, the kinship term with overt possessive is used in subordinate clause and A or B is chosen as antecedent to that possessive, the pronoun should be deleted.

(5)
a. Bill introduced John to his father.
b. Mary asked Jane to call her mother.

c. The boys wanted the girls to show the books to their mothers.

The rule has been successfully applied to ABBYY Compreno MT system and verified.


Kurt Eberle

 

Kurt Eberle received his dissertation and habilitation in linguistics and computational linguistics from the university of Stuttgart in 1991 and 2004.He holds master degrees in Romance Languages and Mathematics received from the universities of Freiburg and Heidelberg in 1983 and 1987.

From 1987 until 1997 he was involved in various NLP projects at the university of Stuttgart and at IBM research in Heidelberg. In 1997 he joined the MT group there where he was responsible for the development of German-French.

In 1999 he was one of the co-founders of Lingenio GmbH (named Linguatec Entwicklung & Services at that time).

Since 2007 Kurt Eberle is associate professor at the University of Heidelberg and since 2009 general manager of the company.

He has published approx. 50 articles and monographs in the fields of MT, syntax and semantics and designed and managed a number of innovative products in the fields of MT and dictionaries at IBM and at linguatec Development&Services/Lingenio.

 


AutoLearn < word >

AutoLearn extracts new translation relations for words and multiword expressions of any category from bilingual texts of any size in high quality, prepares the information found as a conventional dictionary entry – with morpho-syntactic and semantic classifications and contextual use conditions. The learning function uses Lingenio’s MT-system and analysis components, which it is integrated in, as knowledge source and adapts the dictionary – and, as a consequence, the MT-system it is connected to – to the needs of the user. Manual intervention is restricted to a very reduced number of difficult cases and can be carried out easily in an ergonomic graphical user interface, without need of effortful training. This is enabled by the underlying MT-architecture with rule-based core and additional statistical features. The use conditions connected to the new dictionary entries are abstracted from the local representation the considered word or expression is part of in the considered reference(s). They restrict the corresponding translation to cases which are similar to the reference(s) the relation has been extracted from, thus avoiding interference with alternative translations contained in the dictionary, if any.A basic version is already available in the current version of Lingenio’s translate.

AutoLearn is a technology enabling the user to automatically extract new bilingual dictionary entries of high quality and with linguistic annotations from bilingual texts and integrate them in MT. The found relations may refer to single words and multiword expression of any category. They comprise linguistic descriptions of the source and target expression, including corresponding morphological, syntactic and semantic categorizations (lemma, stem, inflection class, subcategorization (case) frame, semantic hyperonym) and conditions on the use in context. The entries can be edited in the graphical dictionary user interface of the MT system and the found default descriptions changed if needed.

The extraction process and the integration of the new entries in the machine translation process do not need training. Only if other translations of the word or expression in question are already available, sparse training is needed to balance the conditions of the new suggestion against the conditions of the existing ones such that the existing suggestions are not hindered and the new one is integrated correctly. This training does not relate to the entire MT system however, but to the word, its translation and its references in the available corpus only.

This is an important advantage over other terminology extraction tools. Display of the extracted relations as conventional dictionary entries with linguistic descriptions, user-friendly interactive editing and modification options and immediate usability of the verified entries are others. The technology has the potential to be significantly extended on the longer run and to be combined with other translation- and text analysis-functions. Its interpretation functions allow to quickly extend standardized bilingual terminology information of the translator workbench (tbx standard) by detailed linguistic annotations with very low error rate and to integrate it into the MT system.

The background of AutoLearn is a hybrid MT architecture with rule-based core and statistical features (cf. Babych et al. 2012). After aligning the sentences of a bilingual text, the method carries out structural analyses of a corresponding source-target sentence pair and then uses the translation knowledge of the MT system to relate words and expressions of the source structure and the target structure to each other. The structures which are related to each other develop from syntactic slot grammar analyses by a mapping to a more abstract shallow semantic representation level. Slot grammar is a unification-based dependency grammar developed at IBM in the context of rule-based Machine Translation (McCord 1989a) and proven by application to many languages during the Logic-based Machine Translation project of IBM (McCord 1989b) and later, among other things, to IBM Watson (McCord et al 2012). The shallow semantic analysis level corresponds roughly to a compact underspecified representation of the predicate argument structure of the sentence which, using the information from the discourse representation of the context, can be extended to a flat underspecified discourse representation structure (FUDRS) of the corresponding theory (FUDRT); cf. (Eberle 2002, 2004), which is an extension of (underspecified) Discourse Reprsentation Theory (DRT / UDRT); cf. (Kamp 1981, Kamp Reyle 1993, Reyle 1993).

The advantage of the FUDRT representation for the purpose of AutoLearn is that it abstracts from the details of the surface structure and concentrates on the representation of the events and states and the corresponding roles as described in the sentence. As on this level, source representation and target representation are typically more similar to each other than on the syntactic level, it is much more suited for the extraction of correspondences than the latter one. Nevertheless, when extracted, the correspondences may be extended by the information from the ‘lower’ (morphological and syntactic) representation levels. As the analysis components include defaults for the handling of words which are unknown to the system, the AutoLearn technology can output entries including linguistic annotations for such words and expressions (of the source language or the target language) also. The mapping between expressions of source- and target sentence provides different levels of certainty where such relations the system can be sure about are called anchors. In a cascaded procedure a best coverage of source- and target sentence is searched. From this, the new relations together with the corresponding linguistic annotations they bear are extracted together with use conditions which are abstracted from the local representation structure they are part of.

A basic version of AutoLearn is already available in the current version of Lingenio’s MT-series translate. The more comprehensive version described here will be integrated in the next version, which is currently worked out.

——————–
Literature

Bogdan, Babych, Kurt Eberle, Johanna Geiß, Mireia Ginestí-Rosell, Anthony Hartley, Reinhard Rapp, Serge Sharoff and Martin Thomas (2012): Design of a Hybrid High Quality Machine Translation System. Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), EACL 2012, Avignon
Kurt Eberle (2002): Tense and Aspect Information in a FUDR-based German French Machine Translation System. In: Hans Kamp, Uwe Reyle (Hrsg.), How we say WHEN it happens. Contributions to the theory of temporal reference in natural language, S. 97–148. Niemeyer, Tübingen, Ling. Arbeiten, Band 455.
Kurt Eberle (2004): Flat underspecified representation and its meaning for a fragment of German (Habilitation) Universität Stuttgart.
Hans Kamp (1981): A Theory of Truth and Semantic Representation. In: J.A.G. Groenendijk, T.M.V. Janssen and M.B.J. Stokhof: Formal Methods in the Study of Language, Mathematical Centre Tract, Amsterdam.
Hans Kamp and Uwe Reyle (1993): From Discourse to Logic, Kluwer Academic Publishers, Dordrecht.
Michael C. McCord (1989a): Slot Grammar: A System for Simpler Construction of Practical Natural Language Grammars. Natural Language and Logic 1989: 118-145
Michael C. McCord (1989b): Design of LMT: A Prolog-Based Machine Translation System. Computational Linguistics 15(1): 33-52 (1989)
Michael C. McCord (2012), J. William Murdock, Branimir Boguraev: Deep parsing in Watson. IBM Journal of Research and Development 56(3): 3
Uwe Reyle (1993): Dealing with ambiguities by underspecification: Construction, representation, and deduction, Journal of Semantics 10(2), pp. 123-179


Kevin Flanagan

 

Kevin Flanagan is pursuing a full-time PhD at Swansea University.

He spent many years as a software developer, using language skills with French clients, prior to starting work as a freelance technical translator. Having used a number of translation memory (TM) systems in that capacity, he developed a prototype TM system providing more effective sub-sentential recall. His PhD research focusses on extending and refining the system, formalising the theoretical principles and delivering a production-ready implementation.


Filling in the gaps: what we need from subsegment TM recall

Alongside increasing use of Machine Translation (MT) in translator workflows, Translation Memory (TM) continues to be a valuable tool providing complementary functionality, and is a technology that has evolved in recent years, in particular with developments around subsegment recall that attempt to leverage more content from TM data than segment-level fuzzy matching. But how fit-for-purpose is subsegment recall functionality, and how do current CAT tool implementations differ? This paper presents results from the first survey of translators to gauge their expectations of subsegment recall functionality, cross-referenced with a novel typology for describing subsegment recall implementations. Next, performance statistics are given from an extensive series of tests of four leading CAT tools whose implementations approach those expectations. Finally, a novel implementation of subsegment recall, ‘Lift’, will be demonstrated (integrated into SDL Trados Studio 2014), based on subsegment alignment and with no minimum TM size requirement or need for an ‘extraction’ step, recalling fragments and identifying their translations within the segment even with only a single TM occurrence and without losing the context of the match. A technical description will explain why it produces better performance statistics for the same series of tests and in turn meets translator expectations more closely.

Translation Memory (TM) has been credited with creating a ‘revolution’ in the translation industry (Robinson 2003: 31). While Machine Translation (MT) – in particular, Statistical Machine Translation (SMT) – is once again transforming how the industry works, and according to Pym, “expected to replace fully human translation in many spheres of activity” (Pym, 2013), TM still very much has a place, either when used alongside MT, or for projects where MT is not used. Widely-used CAT tools such as SDL Trados Studio and Wordfast Pro – products built around TM – provide MT system integration allowing translators to benefit from MT and TM translations, reflecting the assertion from Kanavos and Kartsaklis that “MT – when combined properly with Translation Memory (TM) technologies – is actually a very useful and productive tool for professional translation work” (Kanavos and Kartsaklis 2010). TM results may be valued alongside MT not least because of a distinction noted by Teixeira, that “TM systems show translators the ‘provenance’ and the ‘quality’ of the translation suggestions coming from the memory, whereas MT systems display the ‘best translation suggestion possible’ without any indication of its origin or degree of confidence” (Teixeira 2011: 2). Waldhör describes the implementation of a ‘recommender’ system intended to exploit such provenance distinctions (Waldhör 2014). Provenance factors aside, TM can complement MT in providing immediate recall of new translation content in a project, without any SMT retraining requirement or risk of ‘data dilution’ (Safaba [no date]), and can be used where there is too little (or no) relevant data with which to train an SMT engine.

Nevertheless, the segment-oriented nature of TM has seemed to restrict its usefulness, in ways to which MT provides an alternative. Bonet explains that, for the TMs at the DGT, “Many phrases were buried in thousands of sentences, but were not being retrieved with memory technology because the remainder of the sentence was completely different” (Bonet 2013: 5), and that SMT trained on those memories enabled some of that ‘buried’ content to be recalled and proposed to translators. In principle, TM subsegment recall – automatically finding phrases within segments that have been translated before, when they occur in a new text, and automatically identifying the corresponding translated phrase in the previously-translated segment – should recover all that content, with all the aforementioned TM benefits that complement MT. In practice, while described by Zetzsche as “probably the biggest and the most important development in TM technology” (Zetzsche 2014), implementations in TM systems of subsegment recall vary widely, and fall very short of that level of capability.

This presentation focusses on TM subsegment recall in three ways. Results are presented from the first survey of translators to gauge their expectations of subsegment recall functionality, cross-referenced with a novel typology for describing subsegment recall implementations. Next, performance statistics are given from an extensive series of tests of four leading current-generation CAT tools whose implementations approach those expectations. Finally, a novel implementation of subsegment recall, ‘Lift’, will be demonstrated (integrated into SDL Trados Studio 2014), based on subsegment alignment and with no minimum TM size requirement or need for an ‘extraction’ step, recalling fragments and identifying their translations within the segment even with only a single TM occurrence and without losing the context of the match. A technical description will explain why it produces better performance statistics for the same series of tests and in turn meets translator expectations more closely.


Nizar Ghoula and Jacques Guyot

 

Nizar Ghoula is a PhD candidate in Information Systems at the University of Geneva.

His fields of interest include the semantic web, knowledge representation and ontology alignment.
He joined the OLANTO team in order to collaborate on the development and enhancement of many CAT tools.

He is also a project manager within the Executive department of the University of Geneva to design a solution for executive programs management.

 

Dr. Jacques Guyot is a senior computer scientist with over 20 years of experience in turning break-through technologies into professional solutions.He is also a researcher at the University of Geneva.He holds a PhD in Computer Sciences from the University of Geneva.

 

 

Prof. Gilles Falquet

Gilles is a professor and researcher at the Centre for Computer Sciences of the University of Geneva. His research is centered around the issue of “accessing knowledge” and its focus includes knowledge-based indexing and information retrieval. Gilles holds a PhD in computer science from the University of Geneva.


Terminology Management revisited

Large repositories publishing and sharing terminological, ontological and linguistic resources are available to support the development and use of translation. However, despite the availability of language resources within online repositories, some natural languages associations cannot be found (rare languages or non-common combinations, etc.). Consequently, multiple tools for composing linguistic and terminological resources offer the possibility to create missing language associations. These generated resources need to be validated in order to be effectively used. Manually checking these resources is a tedious task and in some cases nearly impossible due to the large amount of entities and associations to go through or due to the lack of expertise in both languages. To resolve this matter and generate sound and safe content, tools are needed to automatically validate and filter associations that make no sense. Hence, a validation tool is based itself on external resources such as parallel corpora which need to be either collected or created and filtered. To solve these matters we propose a set of tools that generate new terminological resources (myTerm) and to filter them using a parallel corpus generated by another tool (myPREP). We describe our methodology for terminology management based on a repository and present its evaluation.

As a consequence of the availability of large repositories publishing and sharing terminological, ontological and linguistic resources on the Web, we notice a significant improvement in the quality of automatic and semi-automatic translation. Nevertheless, these resources are not yet available for all possible combinations of pairs of languages. For example, to run or enhance a translation process from a language A to a language B, we may need a dictionary or a terminology associating both languages. Despite the existence of these types of resources for A and B within online repositories, none of them may directly associate A and B (particularly in the case of rare languages or non-common combinations, etc.). One way to address this issue consists in using available language resources to generate the missing ones. Hence, automatically deriving terminologies by transitivity has become a very common procedure to produce supporting resources for language services. Within this context, the Olanto foundation (www.olanto.org) produces Open Source tools for professionals. One of the latest tools in development by Olanto is the myTerm terminology manager. Based on a previous research work on building a repository of multilingual terminological and ontological resources [Ghoula et al. 2010], we identified the following objectives for such a tool:

  • Compatibility of the resources representation models with TBX (basic);
  • Ability to manage a large number of terminological resources;
  • Ability to support a large number of standards and formalisms for resources representations (TBX, UTX, DXDT, GlossML, …);
  • Availability of XML-based representation models for structured resources that do not correspond to all standards or formalisms (e.g. JIAMCATT: www.jiamcatt.org).

In myTerm, resources are imported into the terminology manager’s repository and attached to a hyper graph where terminological resources from different domains connect languages to each other directly or by indirectly by transitivity or composition. For example, if we have a glossary EN->FR and another glossary FR->DE, using composition, we can generate a new glossary EN->DE. It is well known that polysemy within both resources can produce associations between pairs of terms that do not make sense. For example, starting from the associations time->temps in EN->FR, temps->Zeit and temps->Wetter, in FR->DE, the composition produces two term associations: time->Zeit and time->Wetter for EN->DE. Consequently, this kind of operations on terminological resources is not completely safe in terms of sense. Therefore, the resulting terminological resource has to be filtered to detect and remove meaningless term associations with regards to the used languages. Manually checking these resources is a tedious task and in some cases even impossible due to the large amount of entities and connections to go through or due to the lack of expertise in both languages.

To solve this issue and generate sound and safe terminological resources, we need tools that automatically filter and validate this type of resources and remove associations that do not make sense. In the context of composing ontology alignments, we encountered the issue of inconsistent mappings, which can be solved using reasoning and confidence measures to filter mappings [Ghoula et al. 2013]. Unfortunately for terminological resources associations, there are no standards or use cases allowing the application of confidence measures. However, it is possible to use a parallel corpus of (aligned) sentences between both languages to assign a confidence measure to associations between pairs of words. This measure is based on the co-occurrence of both terms in the sentences of the corpus. To retrieve the aligned sentences containing the associated terms we use the indexer [Guyot et al. 2006] of our myCAT tool.

Going back to our example, we can find in the EN -> DE corpus a number of co-occurrences of time and Zeit that confirm the time->Zeit association (« Members shall furnish statistics and information within a reasonable time… » -> « Die Mitglieder legen Statistiken und Angaben innerhalb einer angemessenen Zeit … ») whereas no co-occurrence confirms the time -> Wetter association (« Measurement shall be carried out in fine weather with little wind. » -> « Die « Messungen sind bei klarem Wetter und schwachem Wind vorzunehmen. »). The reference parallel corpora are produced by myPREP, Olanto’s text aligner tool. This tool automatically aligns pairs of documents from a multilingual corpus at the sentence level and generates a translation memory in TMX format. It also generates comparable corpora.

In this paper we describe our approach and present the architecture of myTerm repository and define operations for producing, managing and filtering terminological resources for languages association. We explain in detail the computation of the measures that filter term associations based on terms co-occurrence.


 Najeh Hajlaoui

 

Najeh Hajlaoui received his PhD in computer science from Joseph Fourier University (Grenoble, France) in 2008 on Multilingualization of ecommerce systems handling spontaneous utterances in natural language.

In 2002 he received his MS in information systems at Joseph Fourier University, and his Joint European Diploma MATIS (Management and Technology of Information Systems).

He is currently Senior Researcher and Project Manager for Machine Translation at the European Parliament in Luxembourg (since August 2013).

Before joining the Idiap Research Institute in December 2011, he has been a Research Fellow at the University of Wolverhampton (UK) in 2011, a Postdoctoral Researcher at Orange Labs (Lannion, France) in 2010, and an Associate Lecturer at Jean Monnet University (Saint- Étienne, France) from 2007 to 2009.


SMT for restricted sublanguage in CAT tool context at the European Parliament

This paper shows that it is possible to efficiently develop Statistical Machine Translation (SMT) systems that are useful for a specific type of sublanguages in a real context of use even when excluding the part of a test set which have an exact match with Translation Memories (TM) in order to be integrated in CAT tools.

Because we believe on the proximity of sublanguages even though it is still hard to practically define a sublanguage, we are proposing on the framework of the MT@EP project at the European Parliament (EP) to develop SMT systems specific for each EP Parliamentary Committee optimised for restricted sublanguages and constrained by the EP’s particular translation requirements.

Sublanguage-specific systems provide better results than generic systems for EP domains showing a very significant quality improvement (5-25% of BLEU score), mainly due to the EP context specificity and to the proximity of sublanguages. This approach is also satisfactory for pairs of under-resourced languages, such as the Slavic families and German.

The sublanguage-specific systems will be integrated in the EP translation workflow to improve TM results offering in priority previous human translations. The development of an algorithm to translate only unmatched segments with MT is in progress.

This paper shows that it is possible to efficiently develop Statistical Machine Translation (SMT) systems that are useful for a specific type of sublanguages in a real context of use even when excluding the part of a test set which have an exact match with Translation Memories in order to be integrated in CAT “Computer Aided Translation” tools. It means that the included part is quite different from the existing translations and consequently harder to translate even for an SMT system trained on the same translation data.

Because we believe on the proximity of sublanguage even though it is still hard to practically define a sublanguage, we are proposing on the framework of the MT@EP project at the Directorate General for Translation (DG TRAD) of the European Parliament (EP) to develop SMT systems specific for each EP Parliamentary Committee optimised for restricted sublanguages and constrained by the EP’s particular translation requirements.

In fact, in our previous research work, we showed that SMT system works very well for small sublanguage with a very small training corpus (less than 10 000 words). This proves that, in the case of very small sub-languages, statistical MT may be of sufficient quality, starting from a corpus 100 to 500 smaller than for the general language. We are proving in this work the validity of this approach in real and restricted context of use clarifying and answering some related questions. We describe also the type of resources we need, mainly Thematic Translation Memory (TTM) presenting some promising results.

In general a sub-language is a subset of the language identified with a particular semantic domain or a linked family of domains. For instance, in our context, health, environment, economy, etc. seems to constitute a restricted sublanguages. In the context of existing applications developed in our DG TRAD, one of the constraints to take into account concerning the use of MT in the workflow is to take the ad-hoc vocabulary to translate EP documents (amendments, laws, etc.). Our objective is to reuse an existing base of translation memory data to better translate unmatched sentences to be validated by translators. Our technical choice involves automatic selection of data to resolve problems of context and quality.

The first results are promising. SMT systems are developed mainly for English-to-French for three domains (Economy “ECON”, Environment “ENVI”, and Control of budget “CONT”). The results showed a significant improvement of 5-25% of BLEU score over generic MT systems depending on the domain and/or on the language pair. It is mainly due to the lexical convergence which is the main characteristic of restricted sublanguage. It is also due to the EP context specificity and to the proximity of sublanguages. This approach is also satisfactory for pairs of under-resourced languages, such as the Finno-Ugric or Slavic families and German.

Contrary to huge volume of data used to develop generic engines (more than 20 million of sentences), the training data used to develop specific systems are very small (less than 100 thousands sentences). However the choice of the data sets is very important: a selected training set avoids the introduction of ambiguous translations and a complete test set is more representative of the domain rather than a single EP document. Based on a single EP document the translation result cannot be generalized since it depends on the matching chance of the document with the training data. Consequently, it is very important to take a representative test set of the domain data.

By excluding sentences having an exact match with the translation memories, we reduce the representativity of the domain but we still have an improvement over a generic system 5-15% of BLEU score. The performance of a specific system is also proportional to the coverage of the domain. The coverage is usually reached after a certain size of training data. We are proposing some indicator of the domain coverage.

The sublanguage-specific systems will be integrated in the EP translation workflow to improve Translation Memory (TM) results offering in priority previous human translations. To more prioritize Translation Memories, the development of an algorithm to translate only unmatched segments with MT is in progress. It include all the sentences that have a higher match score (e.g between 82% and 99%).


Miguel A. Jiménez-Crespo

 

Miguel A. Jiménez-Crespo is an Assistant Professor in the Department of Spanish and Portuguese at Rutgers University, where he directs the MA program in Spanish Translation and Interpreting.

He holds a PhD in Translation and Interpreting Studies from the University of Granada, Spain.

His research focuses on the intersection of translation theory, translation technology, digital technologies, corpus-based translation studies and translation training.

He is the author of Translation and Web Localization published by Routledge in 2013, and has published extensively in the top international journals in the discipline of Translation Studies.


Beyond prescription: What empirical studies are telling us about localization crowdsourcing

During the last two decades we have witnessed the emergence of the digital era that now permeates most aspects of human lives (Cronin 2013; Folaron 2012).

Translation, as the enabler of these transnational flows, has gained increasing attention in this context, and has expanded considerably, due to the democratizing participatory and open nature of the Web 2.0. This dramatic shift has given rise to new phenomena and practices, such as crowdsourcing and collaborative translation practices enabled by web platforms (O´Hagan 2011; Olohan 2013).

The Localization Industry has initially responded to this challenge with prescriptive accounts of crowdsourcing initiatives and of best practices (i.e. European Commission 2012; De Palma and Kelly 2011; Munro 2011).

Translation Studies has recently begun to turn its attention to this phenomenon, mostly from an empirical perspective. Studies have focused on motivation of volunteers to participate (Camara forthcoming; Olohan 2014; McDonough-Dolmaya 2012; O’Brien and Schäler 2010) and corpus-based studies of translation quality and naturalness (Jimenez-Crespo forthcoming, 2013; Olvera and Gutierrez 2012).

This presentation reviews the findings of these studies and interrelates them with the prescriptive accounts from the industry. In doing so, it will help bridge the existing gap between the localization industry and Translation Studies.

During the last two decades we have witnessed the emergence of the digital era that now permeates most aspects of human lives (Cronin 2013; Folaron 2012). This digital revolution has enabled a globalized world in which people, cultures, knowledge, businesses and communications move across borders in ever increasing volumes.

Translation, as the enabler of these transnational flows has expanded considerably, partly due to the dynamic nature of the Web 2.0.

This dramatic shift has given rise to new phenomena and practices, such as crowdsourcing and other volunteer translation of digital content (Babych et al 2012; O’Hagan 2013; Jiménez-Crespo 2013a; Olohan 2014). These new practices are reshaping society’s views and attitudes towards translation, as well as the Translation Studies as a discipline.

Different stakeholders in this field have responded differently to this new exciting phenomenon.

The Localization Industry has initially responded to this challenge with prescriptive accounts of crowdsourcing initiatives and of best practices (i.e. European Commission 2012; Desilets and Van de Meer 2011; De DePalma and Kelly 2011; Mesipuu 2011).

Computational Linguistics and Machine Translation have also delved into the development of workflows and models to harness the knowledge of the crowd (i.e. Shimoata et al. 2001; Zaidan and Calliston Burch 2012; Morera-Mesa and Filip 2013).

Translation Studies has recently begun to turn its attention to this phenomenon, mostly from an empirical perspective. Empirical research has mostly focused on motivation of volunteers to participate in translation initiatives (Camara forthcoming; Olohan 2014; McDonough- Dolmaya 2012; O’Brien and Schäler 2010), as well as corpus-based studies of translation quality and naturalness in crowdsourced texts (Jimenez-Crespo 2013b; Olvera and Gutierrez 2012).

This presentation reviews the findings of these studies and interrelates them with the prescriptive accounts from the industry.

What can these studies add to the way the localization industry approaches crowdsourcing? How can Translation Studies research help develop new approaches to localization crowdsourcing?

In doing so, it will help bridge the existing gap between the localization industry and Translation Studies.
——————–
References

Babych, B. et al. (2012). “MNH-TT: a Collaborative Platform for Translator Training”. Translating and the Computer 34. Online. Available HTTP: <http://www.mt-archive.info/Aslib-2012-Babych.pdf>
Camara, L. (forthcoming). “Motivation for Collaboration in TED Open Translation Project”. International Journal of Web Based Communities.
DePalma, D. A. and Kelly, N. (2011). “Project management for crowdsourced translation. How user translated content projects work in real life”. In Dunne, K. J. and Dunne, E. S. (eds.) Translation and localization project management: The art of the possible. Amsterdam, Philadelphia: John Benjamins Publishing Company, pp. 379-408.
Désilets, A. and Van der Meer, J. (2011). “Co-creating a repository of best practices for collaborative translators”. In O’Hagan, M. (ed.) Linguistica Antverpienisia New Series – Themes in Translation Studies. Translation as a Social Activity, 10/2011, pp. 27-45.
European Commission (2011). Crowdsourcing. Brussels: Directorate General of Translation.
Jimenez-Crespo, M. (2013a). Translation and Web Localization. London: Routledge.
Jiménez-Crespo, M. A. (2013b). “Crowdsourcing, Corpus Use, and the Search for Translation Naturalness: A Comparable Corpus Study of Facebook and Non-Translated Social Networking Sites”. TIS: Translation and
Interpreting Studies, 8: 23-49.
Morera-Mesa, J.J. and D. Filip. (2013). “Selected Crowdsourced Translation Practices”. ASLIB Translating and the Computer 35, 2013.
McDonough Dolmaya, J. (2012). “Analyzing the Crowdsourcing Model and its Impact on Public Perceptions of Translation”. The Translator, 18(2): 167-191.
Mesipuu, M. (2012). “Translation crowdsourcing and user-translator motivation at Facebook and Skype”. Translation Spaces, 1:33-53.
Munday, J. (2011). Introducing Translation Studies. London: Routledge.
O’Brien, S. and Schäler, R. (2010).”Next Generation Translation and Localization. Users are Taking Charge”. Translating and the Computer Conference, 17-18 November 2010, London. Online. Available HTTP: < http://doras.dcu.ie/16695/1/Paper_6.pdf >
O’Hagan, M. (2013). “The Impact of New Technologies on Translation Studies: A Technological Turn?”. In Millán-Varela, C. and Bartrina, F. (ed), Routledge Handbook of Translation Studies. London: Routledge, pp. 503- 518.
Olohan, M. (2014). “Why do you Translate? Motivation to Volunteer and TED translation”. Perspectives: Studies in Translatology, 7:1, 17-33.
Perez, E and O. Carreira (2011). “Evaluación del Modelo de Crowdsourcing Aplicado a la Traducción de Contenidos en Redes Sociales: Facebook”. In Calvo Encinas, E. et al. (eds.), La Traductología Actual: Nueva Vías de Investigación en la Disciplina. Granada: Comares, pp. 99-118
Zaidan, O. F and Calliston-Burch, C. (2011). “Crowdsourcing Translation: Professional Quality from Non-Professionals”. Proceedings of the 49th Annual Meeting of the Association of Computational Linguistics, p. 1120- 1129. Online. Available HTTP: <http://www.cs.jhu.edu/%7Eozaidan/AOC/turk-trans_Zaidan-CCB_acl2011.pdf


Terence Lewis

 

Terence Lewis, MITI, entered the world of translation as a young brother in an Italian religious order, when he was entrusted with the task of translating some of the founder’s speeches into English. His religious studies also called for a knowledge of Latin, Greek and Hebrew.

After some years in South Africa and Brazil, he severed his ties with the Catholic Church and returned to the UK where he worked as a translator, lexicographer (Harrap’s English-Brazilian Portuguese Business Dictionary) and playwright.

As an external translator for Unesco he translated texts ranging from Mongolian cultural legislation to a book by a minor French existentialist.

At the age of 50 he taught himself to program and wrote a Dutch-English machine translation application which has been used to translate documentation for some of the largest engineering projects in Dutch history.

For the past 15 years he has devoted himself to the study and development of translation technology.


Getting the best out of a mixed bag

This presentation discusses the development and implementation of an approach to the combination of machine translation and translation memory tecnologies in a TM vendor and platform independent environment.

In this workflow the machine translation system itself is able to consult and draw upon the content of any number of relevant translation memories and phrase tables containing subsegments (ngrams) prior to the operation of the rule-based stage of a machine translation process.

Since the machine translation engine directly searches for matches in the TMX files it is possible to enjoy the benefits of translation memory without deploying a commercial translation memory application.

The output of the process is a TMX file containing a varying mixture of TM-generated and MT-generated sentences. Translators can import this file into their respective translation memory applications for post-editing or “sanity checking”.

The author has designed this workflow using his own language engineering tools written in Java. However, this workflow could be easily implemented using NLP tools available in the Open Source community.

There is broad agreement today that improvements in the fluency of machine translation output can be achieved by the use of approaches that harness human translations. This paper discusses the development and implementation of an approach to the combination of machine translation and translation memory technologies in a TM vendor and platform independent environment. Early methods for the combination of machine translation and translation memory tools involved the use of an analysis made by translation memory software to produce an export file containing unknown segments which were then fed into a machine translation system. This technique has been superseded in practice by the introduction of MT plug-ins which are now available in the major commercial translation memory applications. Many professional translators use these plug-ins, in a “pre-translate” stage in preference to fuzzy matches below 70% to produce draft translations which they then revise.

The author has designed and implemented an approach whereby the machine translation system itself is able to consult and draw upon the content of any number of relevant translation memories and phrase tables containing subsegments (ngrams) prior to the operation of the rule-based stage of a machine translation process. The translation memories are built from bitext in the public domain, the work of human translators in organisations with which the author has a contractual relationship and post-edited machine translations. All this data is stored in TMX files. Since the machine translation engine directly searches for matches in the TMX files it is possible to enjoy the benefits of translation memory without deploying a commercial translation memory application. The output of the process is a TMX file containing a varying mixture of TM-generated and MT-generated sentences. Translators import this file into their respective translation memory applications for “sanity checking”, post-editing or full-blown revision.

The search of the translation memories will only retrieve 100% matches. In practice many sentences in the input file will match at subsegment level with so-called “fuzzy matches”. In the automated workflow set up by the author subsegments of the translation units contained in the translation memories are stored in phrase tables. These phrase tables are constructed by fragmenting or decomposing the sentences contained in the translation memories through the application of a series of syntactic rules. The end result superficially resembles the phrase tables used in some statistical machine translation applications. At runtime the phrase tables are consulted after the translation memories. They typically comprise noun phrases and verb phrases. Unlike the results of the translation memory search, which are complete sentences, these subgments are tagged and may even contain some semantic information. At this stage in the workflow the in-process sentence may comprise a mixture of strings from the source sentence and tagged subsegments in the target language. An example of such an in-process sentence (for the Dutch-English pair) would be:
for the purposes of this agreement legal entity means any natural or any legal person created under national law at its place of establishment or under Community law en die rechtspersoonlijkheid bezit en in eigen naam ongeacht welke rechten en plichten kan hebben.

This sentence would then be piped into the rule-based engine which would attempt to resolve the parts of the sentence that are untagged. In this resolution, the engine is able to avail itself of the syntactic information provided by the tags, thereby achieving a more accurate translation than a “fuzzy match”.

Whilst certainly not the only approach to automated translation that combines data from translation memories with the results of rule-based translation, this process has the advantage of being executable in a totally automated way from the command line and produces an output file with very few “unknowns” that can be imported into any commercial translation memory application. The process itself, however, does not necessitate the use of a commercial TM environment.

The author has designed this workflow using his own language engineering tools written in Java. However, this workflow could be easily implemented using NLP tools available in the Open Source community.


Victoria Porro

 

Victoria Porro holds a bachelor and a master in Translation Studies and Translation Technologies.

She joined the Translation Technologies Department of the University of Geneva as a Research and Teaching Assistant in June 2012.

Since then, she devotes most of her time to the EU-funded ACCEPT project she participates in and she is currently designing a PhD project in post-editing and machine translation.

She is most interested in opening new lines of research in post-editing and advocates for the recognition of post-editing as a highly skilled activity.

 

Johanna Gerlach

Johanna Gerlach started working as a Research and Teaching Assistant at the Translation Technologies Department of the University of Geneva in 2008.

She contributed to the MedSLT and CALL-SLT projects, developing linguistic resources for German.

Currently, she is involved in the ACCEPT European project, investigating pre- and post-editing technologies for user generated content.

In 2012, she began working on her PhD thesis, which focuses on the development and evaluation of pre-editing rules for French forum content.

Other co-authors Pierrette Bouillon and Violeta Seretan provided no bio data.


Rule-based automatic post-processing of SMT output to reduce human post-editing effort: a case study

User-generated content (UGC) now represents a large share of the informative content available on the web. However, its uneven quality can hinder both readability and machine-translatability, preventing sharing of knowledge between language communities.

The ACCEPT project (http://www.accept-project.eu/) aims at solving this issue by improving Statistical Machine Translation (SMT) of community content through minimally-intrusive pre-editing techniques, SMT improvement methods and post-editing strategies. Within this project, we have developed linguistic post-editing rules intended to reduce post-editing effort, by automatically correcting the most frequent errors before submitting MT output to the post-editor.

In the present study, we focus on English to French SMT and describe and evaluate post-editing rules for French. The post-editing rules treat two types of phenomena: (1) MT-specific errors, and (2) general spelling and grammar rules.

To quantify the usefulness of these rules, we developed a tool that checks automatically if the post-editors have kept the modifications produced by the rules. In order to evaluate this tool, we will compare the obtained results with those produced by a manual evaluation.

Since the emergence of the web 2.0 paradigm, forums, blogs and social networks are increasingly used by online communities to share technical information. User-generated content (UGC) now represents a large share of the informative content available on the web.

However, its uneven quality can hinder both readability and machine-translatability, preventing sharing of knowledge between language communities (Jiang et al, 2012; Roturier and Bensadoun, 2011).

The ACCEPT project (http://www.accept-project.eu/) aims at solving this issue by improving Statistical Machine Translation (SMT) of community content through minimally-intrusive pre-editing techniques, SMT improvement methods and post-editing strategies. Within this project, the forums used are those of Symantec, one of the project partners. Pre-editing and post-editing rules are developed and applied using the technology of another project partner: the AcrolinxIQ engine (Bredenkamp et al, 2000). This rule-based engine uses a combination of NLP components and enables the development of declarative rules, written in a formalism similar to regular expressions, based on the syntactic tagging of the text. Specific plugins allow to check compliance with pre- and post-editing rules in real-time in the Symantec forum interface (Roturier et al, 2013).

During the first year of the project, we found that pre-editing significantly improves MT output quality (Lehmann et al, 2012; Gerlach et al, 2013a). Further work (Gerlach et al, 2013b) has shown that pre-editing that improves SMT output quality also has a positive impact on bilingual post-editing time. We are now developing post-editing rules intended to reduce post-editing effort, by automatically correcting the most frequent errors before submitting MT output to the post-editor.

Although several studies describe post-editing rules and evaluate results using automatic metrics or fluency-adequacy measures (Guzman, 2008; Valotkaite et al, 2012), few look into actual use of the modifications produced by the rules. In the present study, we describe and evaluate post-editing rules for French, and propose an automatic evaluation method based on the tokens kept by human post-editors. In what follows, we briefly describe the rules and evaluation methodology.
For French, we developed 26 automatic monolingual rules, which treat two types of phenomena: (1) MT-specific errors (je suis en espérant -> j’espère), and (2) general spelling and grammar rules (commentaires apprécié -> commentaires appréciés). We used different resources to develop the rules: type of edits, bilingual terminology extraction on source and raw translation, and spell-checking of the raw translation. Over a corpus of 10000 sentences, the rules flag 57% of sentences.

For the evaluation, we applied our rules automatically on an unseen corpus and asked post-editors to manually post-edit this data in a bilingual setting. To quantify the usefulness of our rules, we developed a tool that checks automatically if the post-editors kept the tokens modified by the rules. In order to evaluate this tool, we will compare the results obtained automatically with those produced by a manual evaluation. We will also look at the correlation with TER and post-editing activity in terms of time and keystrokes.
——————–
Bredenkamp, A., Crysmann B., and Petrea, M. (2000). Looking for errors: A declarative formalism for resource-adaptive language checking. In Proceedings of LREC 2000. Athens, Greece.
Gerlach, J., Porro, V., Bouillon, P., and Lehmann, S. (2013a). La préédition avec des règles peu coûteuses, utile pour la TA statistique des forums ? In Proceedings of TALN/RECITAL 2013. Sables d’Olonne, France.
Gerlach, J., Porro, V., Bouillon, P., and Lehmann, S. (2013b). Combining pre-editing and post-editing to improve SMT of user-generated content. In Proceedings of MT Summit XIV Workshop on Post-editing Technology and Practice. Nice, France.
Guzmán, R. (2008). Advanced automatic MT post-editing. In Multilingual, #95, vol.19, issue 3, pp. 52-57.
Jiang, J., Way, A., and Haque, R. (2012). Translating User-Generated Content in the Social Networking Sace. In Proceedings of AMTA 2012, San Diego, USA.
Lehman, S., Gottesman, B., et al (2012). Applying CNL Authoring Support to Improve Machine Translation of Forum Data. In Kuhn, T., Fuchs, N., eds., Controlled Natural Language. Third International Workshop, p. 1-10.
Roturier, J., and Bensadoun, A. (2011). Evaluation of MT Systems to Translate User Generated Content. In Proceedings of the MT Summit XIII, p. 244-251. Roturier, J., Mitchell, L., Silva, D. (2013). The ACCEPT Post-Editing Environment: a Flexible and Customisable Online Tool to Perform and Analyse Machine Translation Post-Editing. In Proceedings of MT Summit XIV Workshop on Post-editing Technology and Practice. Nice, France.
Valotkaite, J. and Asadullah, M. (2012). Error Detection for Post-editing Rule-based Machine Translation. In Proceedings of the AMTA 2012-WPTP2. San Diego, USA.


Nasredine Semmar

 

Nasredine Semmar obtained his Ph.D. in 1995 at University of Paris-Sud (France)) in 1995 on Multimedia software localization.

He worked from 1996 to 2000 as an R&D engineer in Lionbridge Technologies – Bowne Global Solutions, he designed and implemented tools for Computer Aided Translation and he participated in delivering the multilingual version of MS Windows 2000.

He then joined SAP – Business Objects where he worked from 2000 to 2002 as an expert in software internationalization and localization.

Since 2002, he has been working as a research scientist at the Vision and Content Engineering Laboratory (LVIC) where he has implemented the treatment of Arabic in the CEA LIST NLP platform for a cross-language search engine and he has developed several tools for sentence and word alignment from parallel corpora.

He is the convenor of the work group “Multilingual information representation” of the ISO/TC37/SC4 and he participates as an expert in promoting the MLIF standard (Multilingual information framework).

Co-authors Othman Zennaki and Meriama Laib provided no biographical data


Using Cross-Language Information Retrieval and Meaning-Text Theory in Example-Based Machine Translation

 

In this presentation, I will present the CEA LIST Example-Based Machine Translation (EBMT) prototype which uses a hybrid approach combining cross-language information retrieval and statistical language modelling.This approach consists, on the one hand, in indexing a database of sentences in the target language and considering each sentence to translate as a query to that database, and on the other hand, in evaluating sentences returned by a cross-language search engine against a statistical language model of the target language in order to obtain the n-best list of translations.The English-French EBMT prototype has been compared to the state-of-the-art Statistical Machine Translation system MOSES and experimental results show that the proposed approach performs best on specialized domains.

 

1 The Cross-language Search EngineParallel corpora are only available for a limited number of language pairs and the process of building these corpora is time consuming and expensive. For example, in the field of news, there are enough corpora, including bilingual, in particular between English and the languages that are economically the most important, in all the other fields available corpora are not sufficient to make statistical machine translation approaches operational.The main idea of the hybrid approach used in the CEA LIST Example-Based Machine Translation prototype is, on the one hand, to use only a monolingual corpus in the target language in order to be independent of the availability of parallel corpora, and on the other hand, to use transfer lexicons and rules to produce translations which are grammatically correct. For each sentence to translate, the cross-language search engine returns a set of sentences in the target language with their linguistic properties (lemma, grammatical category, gender, number and syntactic dependency relations). These properties are combined with translation candidates provided by transfer lexicons and rules. The result of this combination is evaluated against a statistical language model learned from the target language corpus to produce the n-best list of translations.

2 The CEA LIST Example-Based Machine Translation PrototypeThe CEA LIST EBMT prototype is composed of:

  • A cross-language search engine to extract sentences or sub-sentences of the target language from the textual database which correspond to a total or a partial translation of the sentence to translate.
  • A bilingual reformulator for lexical and syntactic transfer of the sentence to translate into the target language.
  • A generator of translations which consists, on the one hand, in assembling the results returned by the cross-language search engine and the bilingual reformulator, and on the other hand, in choosing the best translations according to a statistical language model learned from the target language corpus.

3 The Cross-language Search Engine

The role of the cross-language search engine is to retrieve for each user’s query translations from an indexed monolingual corpus. This cross-language search engine is based on a deep linguistic analysis of the query and the monolingual corpus to be indexed (Semmar et al., 2006). It is composed of the following modules:

  • The linguistic analyzer LIMA (Besançon et al., 2010) which includes a morphological analyzer, a Part-Of-Speech tagger and a syntactic analyzer. This analyzer processes both sentences to be indexed in the target language and sentences to translate in order to produce a set of normalized lemmas, a set of named entities and a set of compound words with their grammatical tags.
  • A statistical analyzer that computes for sentences to be indexed concept weights based on concept database frequencies.
  • A comparator which computes intersections between sentences to translate and indexed sentences and provides a relevance weight for each intersection. It retrieves the ranked and relevant sentences from the indexes according to the corresponding reformulated query (sentence to translate) and then merges the results obtained for each language taking into account the original words of the query (before reformulation) and their weights in order to score the returned sentences.
  • A reformulator to expand queries (sentences to translate) during the search. The expansion is used to infer from the original query words other words expressing the same concepts. The expansion can be in the same language or in different language by using bilingual lexicons.
  • An indexer to build the inverted files of the sentences to be indexed on the basis of their linguistic analysis and to store these sentences in a database.

4 The Bilingual Reformulator

Because the indexed monolingual corpus does not contain the entire translation of each sentence, a mechanism to extend translations returned by the cross-language search engine is used. This is done by the bilingual reformulator which consists, on the one hand, in transforming into the target language the syntactic structure of the sentence to translate, and, on the other hand, in translating its words. This reformulator uses the English-French bilingual lexicon of the cross-language search engine to translate words and a dozen of linguistic rules to transform syntactic structures. These rules create hypothesis translations for the sentence to translate.

5 The Generator of Translations

The generator of translations consists in producing a correct sentence in the target language by using the syntactic structure of the translation candidate. A flexor is used in order to obtain the right forms of the translation candidate words. The flexor transforms the lemmas of the words of the target language sentence into their surface (inflected) forms. Linguistic properties returned by the cross-language search engine are used to produce the right form of the lemma. This flexor consists in transforming the lemma of a word into the surface form of this word by using the grammatical category, the gender and the number of the word.
———-
References
[1]     Nasredine Semmar, Meriama Laib and Christian Fluhr. A Deep Linguistic Analysis for Cross-language Information Retrieval. Proceedings of LREC, Italy, 2006.
[2]     Besançon Romaric, Gaël De Chalendar, Olivier Ferret, Faiza Gara, Meriama Laib, Olivier Mesnard and Nasredine Semmar. LIMA: A Multilingual Framework of Linguistic Analysis and Linguistic Resources Development and Evaluation. Proceedings of LREC, Malta, 2010.


Eduard Šubert

Eduard Šubert studies informatics and mathematics at Czech Technical University in Prague.

He was introduced to computational linguistics in course at the university.
Among other interests, Eduard is responsible for the creation of science popularizing video content of his faculty’s YouTube channel.

Additionally, he works on a computer simulation of lens polishing process for the Academy of Sciences of the Czech Republic.

Ondrej Bojar graduated in computer science in 2003 and received his Ph.D. in computational
linguistics in 2008 at the Faculty of Mathematics and Physics, Charles University in Prague.
He now works as an assistant professor at the faculty. His main research interest is machine
translation. He participated in the Johns Hopkins University Summer Engineering Workshop
in 2006 as a member of the Moses team. Since then, he is regularly taking part in WMT
shared translation tasks mainly with systems based on Moses and adapted for English-to-Czech
translation. He was the main local organizer of MT Marathons 2009 and 2013 held in Prague.

Twitter Crowd Translation – Design and Objectives

Co-authors: Eduard Šubert and Ondřej Bojar

In this presentation we will present how, in our project we aim to build an online infrastructure for providing translation to social media and gathering relevant training data to support machine translation of such content.
Social networks have gained tremendous popularity but the natural language barrior remains.
While machine translation can perform satisfactorily on edited texts, it is still near to unusable on noisy badly-written messages on social networks.

We endeavour to solve this inadequacy by crowdsourcing.

1 Introduction

This paper presents Twitter Crowd Translation (TCT), our project aimed at development of an online infrastructure serving two purposes: (1) providing online translation to social media and (2) gathering relevant training data to support machine translation of such content. We focus on Twitter and the open-source machine translation toolkit Moses. Our project heavily relies on unpaid voluntary work.

In Part 2, we provide the motivation for both goals of our work. Part 3 describes the overall design of our tool in terms of “social engineering” and Part 4 complements it by the technical aspects.

2 Motivation

Social networks have gained tremendous popularity and have successfully replaced many established means of communication. While geographical location of the users has little to no impact on communication, the obstacle of languages used remains.

For stable and long-lasting content, the problem is less severe: services such as the Wikipedia have shown that volunteers are able to provide translations into many languages. Machine translation is easy to train on such content and delivers moderately good results.

On the other hand, social networks are used in a streaming fashion, Twitter being the most prominent example. Anybody can contribute message, which is forwarded to a number of followers. These, in turn, are flooded with messages from sources they select. Given theconstantow of new information, nobody looks back at older messages.

Providing translation to “streaming networks” is much more challenging. The input is much noisier, signi cantly reducing MT output quality, and the community is less interested inproviding manual translations.

The social motivation of our project is to break the language barrier for streaming social networks. The technological motivation is to advance MT quality by collecting more and better- t data. What Wikipedia and on-line MT services manage for stable content, we would like to achieve for streaming networks and casual, unedited content.

3 Design of TCT

We see two main reasons for people to contribute to community translation of Wikipedia and other projects: sharing the information (“What is useful for me in my language may be useful for others.”), and self-promotion (“I will gain good reputation by contributing well received translations.”). We designed our project in accordance to these findings.

TCT should be as thin layer as possible, to cause minimal disruption. The majority of users stay within their platform – Twitter in this case.

To better explain the processes of TCT, we assign users roles: Author, Selector, Translator, Judge and Recipient.

Figure 1 (below) summarizes the workflow: a tweet in a foreign language is posted by Author and observed by a Selector. The Selector does not fully understand the message and submits it for translation to the language of his choice. Our TCT server collects this request and forwards it to human and machine Translators. Translations are collected and Judges evaluate their quality, high-con dence machine translation might bypass this step. The best translation is tweeted to Selector and other Recipients by our server. The same user can take several roles in the process.


Figure 1: Twitter Crowd Translation in a nutshell.

We think that each of the user groups pro fits from using TCT. Author gains bigger audience. Selector achieves full understanding of the tweet. Translator and Judge practice their language skills and Translator is placed in the TCT hall of fame. Finally Recipient gains more of understandable content.

4 Technical Aspects of TCT

To remain in the Twitter platform, Selector submits messages as tweets marked with hashtag #tctrq and TCT uses Twitter REST API to search twitter feed for such tweets.

Once tweets are collected, Translators are noti fied via e-mail to which they respond with translations.

Judges are required to contribute via TCT website and evaluate the quality of translations by blind one-to-one comparison. An interesting feature is password-less registration. Translators are the only group required to register but their interaction is strictly e-mail based, all necessary settings are accessed byexpiring links sent via e-mail on request.
——————–
References

Moses – http://www.statmt.org/moses/
Twitter – http://twitter.com/


Tengku Sepora Tengku Mahadi

 

Tengku Sepora Tengku Mahadi (Associate Professor Dr.) is Dean of the school of Languages, Literacies and Translation at the University Sains, Malaysia. She lectures in translation theories and practice and supervises research in Translation Studies at MA and PhD levels.

She is the author of Text-wise in Translation (2006), and co-author of Corpora in Translation: A Practical Guide (2010).


Losses and Gains in Computer-Assisted Translation: Some Remarks on Online Translation of English to Malay

The paper begins with a concise investigation of the significance of the translation technology in modern life as well as machine and computer-assisted translation. It then defines the technology accessible to translators and examines the losses and gains of the applied tools in computer-assisted translation include electronic dictionaries that conventionally divided into online and offline dictionaries. Subsequently, the paper studies the influence of the online dictionaries on the professional translator, concluding that to what extent can translation be accurate.

Loss in machine translation is inevitable due to the differences between English and Malay as entirely two different languages and not-related language pairs for translation.

The online dictionary and translation software cannot replace the human translator and guarantee high-quality translations. Online dictionaries and other translation means accelerate and facilitate the translation process only by minimising the required time for translation.

The aim of the paper is to investigate the new technologies in machine translation tools in order to investigate the losses and gains in translation of English to Malay by using online dictionaries.
Machine translations employing online dictionaries are compared with the translation done by a human translator to analyse the probable errors in machine-translated texts.

The paper begins with a concise investigation of the significance of the translation technology in modern life as well as machine and computer assisted translation. It then defines the technology accessible to translators and examines the losses and gains of the applied tools in computer-assisted translation include electronic dictionaries that conventionally divided into online and offline dictionaries. Subsequently, the paper studies the influence of the online dictionaries on the professional translator, concluding that to what extent can translation be accurate. Loss in machine translation is inevitable due to the differences between English and Malay as entirely two different languages and not-related language pairs for translation.

The online dictionary and translation software cannot replace the human translator and guarantee high-quality translations. Online dictionaries and other translation means accelerate and facilitate the translation process only by minimising the required time for translation. The aim of the paper is to investigate the new technologies in machine translation tools in order to investigate the losses and gains in translation of English to Malay by using online dictionaries. Machine translations employing online dictionaries are compared with the translation done by a human translator to analyse the probable errors in machine-translated texts.

The online dictionary and translation software cannot replace the human translator and guarantee high-quality translations, despite their efficiency and outlooks. Online dictionaries and other translation means accelerate and facilitate the translation process only by minimising the time needed for translation. A high-quality translation results from the combination of electronic technologies and the translator’s skills, of good knowledge of a foreign language and theory of translation because programs and translation software will not replace humans even in the long-term future—at least not until actual high-performance artificial intelligence is created. Therefore, much depends on the translator’s personality and his professional experience, while electronic systems are useful, necessary and sometimes required supplements.

The paper, therefore, seeks answer to the following questions:

  • What are the losses and gains in computer-assisted translation of English and Malay?
  • Which online – offline dictionaries are genuinely useful to translators?
  • Do the new technologies threaten the livelihood of the translator?
  • Does automation translation understandable?

Anne Marie Taravella

Anne Marie Taravella, cert. tr. (OTTIAQ) is a doctoral student and part-time faculty at Université de Sherbrooke, in Québec (Canada), as well as a member of Ordre des traducteurs, terminologues et interprètes du Québec (OTTIAQ).

She holds a BA in Translation and a MA in Translation Studies, both from Montreal-based Concordia University. She is also a graduate from Université de Paris-IX Dauphine, France, in Management Science.

Anne Marie is now pursuing a Doctorate in Business Administration (DBA) at the Faculty of Administration of Université de Sherbrooke, under the supervision of Alain O. Villeneuve, DBA.
Her research interests are translation work organization, adoption of information technologies in organizations, workplace well-being and positive organizational scholarship.

Her doctoral research focuses on the variation of language specialists’ affective states in the workplace.

Her research is supported by the Canadian Social Sciences and Humanities Research Council.


Affective Impact of the use of Technology on Employed Language Specialists: An Exploratory Qualitative Study

A well-established fact in the information systems literature is the importance of human aspects of technology use.

In our doctoral research, we look into the emotional effort that employed language specialists have to put in their daily work, in the light of an increased use of language technology tools (LTT) by language service providers.

In 2011 and 2012, we conducted qualitative studies to understand how LTT were perceived by language specialists. We observed translators and other language specialists at work and conducted 12 in-depth interviews. We noticed that respondents often mentioned affective constructs, such as stress or anxiety, even when not prompted to describe their affective state.

We then reanalyzed our transcripts and written notes in search for answers to the following specific question: “What affective variables do language specialists spontaneously mention when asked to describe their use of LTT?” Using content analysis, we found that respondents often mention some form of occupational stress, or relief of occupational stress, along with other affective variables, in relation with the use of LTT.

We argue that emotional well-being and stress relief should be measured and serve as a guide for the design and implementation of language technology tools.

A well-established fact in the information systems literature is the importance of human aspects of technology use. When Glass, Ramesh and Vessey (2004) compared Computer science (CS), Software engineering (SE) and Information systems as the main academic subdivisions of Computing discipline, they found that IS was the only research field conducting analysis at the behavioral level, while the other two were conducting analysis at the technical level. Using machine translation or refusing to work as a post-editor are examples of behaviors that can be analysed in the view of Information systems.

According to Affective Events Theory (Weiss and Cropanzano, 1996), reactions to work events (e.g. the introduction of a new technology) shape workers’ affective states; in turn, affective reactions cause workers to adopt specific behaviors. Thus, a better understanding of language specialists behaviors at work (e.g. why and when they use language technology tools like translation memory and machine translation) starts with a better understanding of their affective reactions to new technology tools. In our doctoral research, we look into the emotional effort that employed language specialists (terminologists, translators, editors) have to put in their daily work, in the light of an increased use of language technology tools (LTT) by language service providers.

In 2011 and 2012, we conducted qualitative studies to understand how LTT were perceived by language specialists. We observed translators and other language specialists at work and conducted 12 in-depth interviews. Respondents were mainly working in Québec (Canada). When analyzing our data for our 2012 study report, we noticed that language specialists observed at work and interview respondents often mentioned affective constructs, such as stress or anxiety, even when not prompted to describe their affective state. We then reanalyzed our transcripts and written notes in search for answers to the following specific question: “What affective variables do language specialists spontaneously mention when asked to describe their use of LTT?” Using content analysis, we found that respondents often mention some form of occupational stress, or relief of occupational stress, along with other affective variables, in relation with the use of LTT.

This is an interesting result, for well-being of human resources that use LTT is never mentioned as a design criteria. Yet, as O’Brien (2012) reminds us very firmly : “[I]t is how the technology is created, or implemented, that has a dehumanising effect. Technology created without consideration for the task or end users removes those end users from the equation.”

We argue that emotional well-being and stress relief should be measured and serve as a guide for the design and implementation of those tools. As a paraphrase of Desilets et al (2009)’s argument saying that « translators might better be served by the research community if it was better informed about their work practices » (p. 1), we argue that translators might better be served by the LTT community it if was better informed about their affective reactions to LTT.

This research was supported by the Social Sciences and Humanities Research Council (Government of Canada). It is partly based on data that were first collected for a Mitacs-Accelerate internship fostered by AILIA, a Canadian language industry association.
——————–
Works cited

Désilets, A., Mélançon, C., Patenaude, G., & Brunette, L. (2009). How translators use tools and resources to resolve translation problems: an ethnographic study. Retrieved on June 18, 2014 from http://www.mt-archive.info/MTS-2009-Desilets-2.pdf.
Glass, R. L., Ramesh, V., & Vessey, I. (2004). An analysis of research in computing disciplines. Communications of the ACM, 47(6), 87-94.
O’Brien, S. (2012). Translation as human-computer interaction. Translation Spaces, 1, 101-122.
Weiss, H. M., & Cropanzano, R. (1996). Affective events theory: A theoretical discussion of the structure, causes and consequences of affective experiences at work. Research in Organizational Behavior, 18, 1-74.


Antonio Toral

Antonio Toral (Dr.), Research Fellow at Dublin City University (DCU).
Obtained his MSc in Computer Science in 2004 and PhD in Computational Linguistics in 2009 from the Universitat d’Alacant (Spain).
He worked as a researcher in CNR-ILC (Italy) from 2007 to 2009, involved in the EU-FP7 projects KYOTO and FLaReNet.
He joined DCU in 2010 where he has been working as a postdoctoral researcher to date, in the EU-FP7 projects Abu-MaTran (coordinator), QTLaunchPad, PANACEA and CoSyne.
He has published more than 70 peer-reviewed papers, has served in the scientific committee of international conferences and workshops and has reviewed papers for three indexed journals of the field.
He has also organised evaluation tasks at the SemEval and EVALITA forums.

Andy Way  (Prof.)

Current affiliations:

  1. Deputy Director of Centre for Next Generation Localisation
  2. Professor of Computing, School of Computing, Dublin City University

Obtained BSc (Hons) in 1986, MSc in 1989, and PhD in 2001 from the University of Essex, Colchester, UK.
1988—1991 worked at the University of Essex, UK, on the Eurotra MT project.
Joined DCU in 1991. Promoted to Senior Lecturer in 2001 and Associate Professor in 2006. DCU Senior Albert College Fellow 2002—03. IBM CAS Scientist 2003—2011. SFI Fellow 2005—2011.
Grants totaling over €9 million since 2000. CNGL PI for Integrated Language Technologies 2007—11, and Deputy Director CNGL 2014–to date.
Over 230 peer-reviewed papers in leading journals and conferences.
Currently supervising 4 students on PhD programmes of study, and has in addition graduated 20 PhD and 11 Msc students.

2010—13 on leave of absence from DCU. From 2010-12 Director of Language Technology at Applied Language Solutions, and from 2012-2013 Director of Machine Translation at Lingo24.
President of the European Association for Machine Translation(2009-to date), and President of the International Association for Machine Translation (2011—13).
Editor of the Machine Translation journal (2007—to date).


Is Machine Translation Ready for Literature?

Given the current maturity of Machine Translation (MT), demonstrated by its growing adoption by industry (where it is mainly used to assist with the translation of technical documentation), we believe now is the time to assess the extent to which MT is useful to assist with translating literary text.

Our empirical methodology relies on the fact that the applicability of MT to a given type of text can be assessed by analysing parallel corpora of that particular type and measuring (i) the degree of freedom of the translations (how literal are the translations) and (ii) the narrowness of the domain (how specific or general that text is). Hence, we tackle the problem of measuring the translatability of literary text by comparing the degree of freedom of translation and domain narrowness for such texts to texts in two other domains which have been widely studied in the area of MT: technical documentation and news.

Moreover, we present a pilot study on MT for literary text where we translate a novel between two Romance languages.

The automatic evaluation results (66.2 BLEU points and 23.2 TER points) would be considered, in an industrial setting, as extremely useful for assisting human translation.

The field of Machine Translation (MT) has evolved very rapidly since the emergence of statistical approaches two decades ago (Brown et al., 1993). MT is nowadays a reality throughout the industry, which continues to adopting this technology as it results in improved translation productivity, at least for technical domains (Plitt and Masselot, 2010).

Having reached this level of maturity, we explore the viability of current state-of-the-art MT for literature, the last bastion of human translation. To what extent is MT useful for literature? At first glance, these two terms (MT and literature) might seem incompatible, but the truth is — to the best of our knowledge — that the applicability of MT to literature has not been studied rigorously from a empirical point of view.

The applicability of Statistical MT (SMT) to translate a given type of text for a given pair of languages can be studied by analysing two properties of the relevant parallel data.

  1. 1. Degree of freedom of the translation. While literal translations can be learnt reasonably well by the word alignment component of SMT, free translations result in problematic alignments.
  2. 2. Narrowness of the domain. Constrained domains lead to good SMT results. This is due to the fact that in narrow domains lexical selection is not really an issue and relevant terms occur frequently, which allows the SMT model to learn their translations accurately.

We conclude that the narrower the domain and the smaller the degree of freedom of the translation, the more applicable SMT is. This is why SMT performs well on technical documentation while results are substantially worse for more open and unpredictable domains such as news (cf. WMT translation task series ²).

We suggest to study the applicability of SMT to literary text by comparing the degree of freedom and narrowness of parallel corpora for literature to other domains widely studied in the area of MT (technical documentation and news). Such a corpus study can be carried out by using a set of automatic measures. The degree of freedom of the translation can be approximated by the perplexity of the word alignment. The narrowness of the domain can be assessed by using measures such as repetition rate (Bertoldi et al., 2013) and perplexity with respect to a language model.

Therefore, in order to assess the translatability of literary text with MT, we put the problem in perspective by comparing it to the translatability of other widely studied types of text. Instead of considering the translatability of literature as a whole, we root the study along two axes:

  1. Relatedness of the language pair: from pairs of languages that belong to the same family (e.g. Romance languages), through languages that belong to the same group (e.g. Romance and Germanic languages of the Indo-European group) to unrelated languages (e.g. Germanic and Sino-Tibetan languages).
  2. Literary genre: from novels to poetry.

We hypothesise that the degree of applicability of SMT to literature depends on these two axes. Between related languages, translations should be more literal and complex phenomena (e.g. metaphors) might simply transfer to the target language, while they might have more complex translations between unrelated languages. Regarding literary genres, in poetry the preservation of form might be considered relevant while in novels it may not.

As a preliminary study, we evaluated the translation of a recent best-selling novel for a related language pair (Spanish to Catalan). The scores obtained (66.2 BLEU points and 23.2 TER points) would be considered, in an industrial setting, as very useful for assisting human translation (e.g. by means of post-editing or interactive MT). We expect these scores to generalise to other related language pairs such as Spanish—Portuguese or Spanish—Italian.³

In summary, we have proposed a methodology to assess the applicability of MT to literature which aims to give an indication of how well SMT could be expected to perform on literary texts compared to the performance of this technology on technical documentation and news.

While we may be far from having MT that is useful to assist with the translation of poetry between distant languages such as English and Chinese, we have provided evidence that state-of-the-art MT can already be useful to assist with the translation of novels between related languages.

——————-

  1. The only work on MT for literature we are aware of (Genzel et al., 2010) translates poetry by constraining an SMT system to produce translations that obey to particular length, meter and rhyming constraints. Form is preserved at the price of producing a worse translation. However, this work does not study the viability of MT to assist with the translation of poetry.
  2. ” height=”0″ data-mce-fragment=”1″>http://www.statmt.org/wmt14/translation-task.html
  3. The lexical similarity between Spanish and Catalan (0.85) is close to that between Spanish and Italian (0.82) and Spanish and Portuguese (0.89). http://en.wikipedia.org/wiki/Lexical_similarity

——————–
References:

Bertoldi, N., Cettolo, M. and Federico, M. (2013). Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation. Machine Translation Summit 2013.
Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., and Mercer, R. L. (1993). The mathematics of statistical machine translation. Computational Linguistics, 19(2):263–313.
Genzel, D., Uszkoreit, J., Och, F.: Poetic statistical machine translation: rhyme and meter. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 158–166. Association for Computational Linguistics (2010)
Plitt, M., and Masselot, F. A productivity test of statistical machine translation post-editing in a typical localisation context. The Prague Bulletin of Mathematical Linguistics 93 (2010), 7–16.


Tom Vanallemeersch

 

Tom Vanallemeersch is a researcher at KU Leuven, Centre for Computational Linguistics.

He has been working in the language technology sector for twenty years, both in academia and industry. His activities mainly involve translation memories, machine translation and alignment of bilingual resources.

He performed work in these fields at Xplanation, LNE International, Lessius University College, Systran (project in collaboration with Centre for Computational Linguistics) and the European Commission.

Other types of language technology he dealt with are multilingual dictionary processing (University of Liège), text-to-speech (Lernout and Hauspie), text mining (Temis) and terminology extraction (coordinator of project at Dutch Language Union).

 

Vincent Vandeghinste is a post-doctoral researcher at the KU Leuven. He joined the Centre for Computational Linguistic in 2000 and has participated in many different national and international projects. He obtained his PhD in 2008 with a thesis that was based on his work for the METIS-II project about machine translation for languages with low resources. He has been involved in machine translation since 2004, and was a member of the Executive Committee of the European Association for Machine Translation in 2011, as the local organiser of the annual conference (EAMT-2011). He is teaching courses in Computational Linguistics in the Bachelor Language and Literature and Natural Language Processing in the advanced Master in Artificial Intelligence program.

 

 


Improving fuzzy matching through syntactic knowledge

Fuzzy matching in translation memories (TM) is mostly string-based in current CAT tools. These tools look for TM sentences highly similar to an input sentence, using edit distance to detect the changes required to convert one sentence to another. Current CAT tools use limited or no linguistic knowledge in this procedure.

In the recently started SCATE project, which aims at improving translators’ efficiency, we apply syntactic fuzzy matching in order to detect abstract similarities and to increase the number of fuzzy matches. We parse TM sentences in order to create hierarchical structures identifying constituents or dependencies. We calculate TER (Translation Edit Rate) between an existing human translation of an input sentence and the translation of its fuzzy matches in TM. This allows us to assess the usefulness of syntactic matching with respect to string-based matching.

In an extended scenario, we pretranslate parts of an input sentence by combining fuzzy matches with the word alignment of a statistical MT system applied to TM.

The output of the system, which deals with the untranslated parts, is compared to the existing human translation.

Techniques for retrieving fuzzy matches from translation memories (TM) are mostly string-based in current CAT tools. Given a sentence to be translated (the query sentence), these tools look for source language sentences in a TM which are highly similar. Such sentences are typically the ones that have the shortest edit distance, i.e. require few changes in order to be converted into the query sentence. Current CAT tools use limited or no linguistic knowledge while looking for similar sentences or for sentence fragments in a TM.

In the recently started SCATE project, which primarily aims at the improvement of translators’ efficiency, we will use syntactic information while performing fuzzy matching, in order to detect sentences which are not only similar on the level of words but also on the more abstract, syntactic level. This is expected to increase the number of fuzzy matches (recall) and, as a consequence, to make translations more consistent and increase speed of translation.

In order to perform syntactic fuzzy matching, we apply a parser to the source sentences in the TM and to the query sentence. This results in parse trees, which are hierarchical structures identifying constituents (noun phrases, clauses, etc.) or dependencies at different levels. The comparison of the parse tree of the query sentence to a, possible very large, set of TM parse trees is a computationally complex problem for which no efficient standard solutions exist. Therefore, we experiment with a strategy which converts parse trees to strings which capture a part of the hierarchical information and applies efficient fuzzy matching techniques devised for strings.

Fuzzy matching can be performed and evaluated in different scenarios. One scenario involves presenting the translation of a fuzzy match in TM to the user, as is the case in most CAT tools. Another scenario involves the use of the translation of a fuzzy match for pretranslating parts of the query sentence before submitting the latter to a machine translation (MT) system which will deal with the untranslated parts. In this second scenario, which is a type of integration of MT with TM, we need to detect the translation of matched parts, for instance by using the word alignment produced by a statistical MT system applied to the TM.

In order to evaluate fuzzy matching techniques in both of the above scenarios, we filter out a set of sentence pairs from the TM in order to use their source sentence as query sentence, and apply fuzzy matching for these query sentences on the TM. As an evaluation metric, we use TER (Translation Edit Rate, a type of edit distance). In the first scenario, we calculate TER between the translation of the matching sentence in TM and the translation of the query sentence. In order for syntactic fuzzy matching to have added value, it needs to result in lower TER scores than string-based matching. In the second scenario, we calculate TER between the MT output for the query sentence and the existing (human) translation of the query sentence.

In a later stage of SCATE, which is a four-year project, we will extend the abovementioned pretranslation strategy by aligning not only words, but also parse trees in the source and target language. The results will be used both in the context of statistical MT and of a linguistically oriented MT system.

As SCATE involves an industrial advisory committee, which steers the project towards the development of techniques useful for a production environment, we will involve human translators who perform postedition.


Marion Wittkowsky

 

Marion Wittkowsky is a lecturer in the Department of International Technical Communication at the Flensburg University of Applied Sciences in Germany since 2007.

She teaches courses in technical writing, technical translation, and applied computer linguistics.

Prior to her position at the University she worked for ten years as a technical translator, project manager and finally as a business unit manager at a language service provider.

A major focus of her translation work was post-editing the machine translation of SAP release notes.


Integrating Machine Translation (MT) in the Higher Education of Translators and Technical Writers

Language experts in the technical communication field such as technical writers and technical translators today have to develop and apply strategies on how to cope with the increasing amount of documentation to be written and translated.

Controlled language in the pre-editing step of MT is one of the methodologies that could offer a solution to produce acceptable target texts that may as well be post-edited, depending on the target audience and quality level expected.

This presentation will show how students in the Master programme at the Flensburg University of Applied Sciences can combine their language skills and the MT technology, and how their work will result in a better understanding and acceptance of MT.

Today’s requirements for future Technical Translators and Technical Writers have changed dramatically since the turn of the millennium. The reasons for this change are, among others, newly developed language technologies, vastly changing working environments for language specialists in the field of technical documentation, and the fact that highly complex machinery and procedures require that much more technical documentation be written and translated. In their article “Controlled Translation:  A New Teaching Scenario Tailor-made for the Translation Industry “,Torrejón and  Rico (2002) describe an approach to teaching translation students pre-editing and post-editing of texts for machine translation systems at the Universidad Europea in Madrid.

At the time of their publication, including pre-editing could definitely be seen as a new aspect in the education of translators. The MT approach of pre-editing texts by applying controlled language is a good one because as a side-effect, the quality of source documents increases.

Although the use of MT in the translation industry has increased, numerous companies still refuse to work with MT. Since their motives have not been researched, one can only make a guess at their reasons for rejecting MT. Perhaps companies have unrealistic expectations on machine translation output or are deterred by the initial effort that has to be made when working with MT. Yet, a variety of successful projects with different approaches to MT show that this effort can indeed result in benefits for the companies involved.

At the University of Applied Sciences in Flensburg, Germany, our Master programme in International Technical Communication includes a course project in which MT is being taught based on the following research questions:

“Does controlled language applied to source texts affect the quality of target texts produced by rule-based MT? What differences can be achieved using different writing styles?”

In the last few years, we have been working with Lucy LT (machine translation solution from Lucy Software), which allows students to add or edit terminology in an integrated lexicon.

The project at the University of Applied Sciences in Flensburg consists of the following phases:

  • Translating authentic sample texts (DE or EN), descriptive and instructive, from various technical fields without pre-editing.
  • Analyzing and evaluating the unedited translated texts.
  • Adding or editing terminology as required.
  • Resolving syntactical ambiguities.
  • Editing the text in several steps with respect to syntax, grammar, style with a translation following every completed step.

There are only two restrictions the students have to observe:

  • The intended information of the source text may not be changed.
  • The grammar of the source text must always be correct.

The results of these numerous translations are sometimes very different, depending on how far the students go when trying to find the best way to adapt the source text. Of course, short sentences have a better output and avoiding ambiguities solves many problems. However, this cannot be achieved by applying controlled language only.

One further aspect is also very important in the project. “What do we expect from a text that has been translated using rule-based MT?” The students learn how the MT system functions and try to adapt the source texts accordingly to obtain an optimized output. Yet, the quality levels after all the changes have been made vary enormously.

Although we have yet to find a universal solution to generate high-quality machine translation output, students gain a broad understanding of rule-based machine translation. Master students who are educated technical translators or technical writers learn to reflect on the advantages and disadvantages of accepting lower quality target texts, provided that the texts are understandable for a specific target audience. In addition, students obtain first-hand experience in improving source texts through appropriate editing. Should a post-editing step be required, translators who are involved in the whole process can easily execute this step. For post-editing, well educated translators are needed as it is not enough to be a native speaker.
——————–
References:

Torrejón, Enrique; Celia Rico (2002): “Controlled Translation: A New Teaching Scenario Tailormade for the Translation Industry.” Proceedings of the 6th European Association for Machine Translation (EAMT) Workshop, “Teaching machine translation“, Manchester, 107-116


Andrzej Zydroń

 

Andrzej Zydroń MBCS CITPCTO @ XTM International, Andrzej Zydroń is one of the leading IT experts on Localization and related Open Standards. Zydroń sits/has sat on, the following Open Standard Technical Committees:

  • LISA OSCAR GMX
  • LISA OSCAR xml:tm
  • LISA OSCAR TBX
  • W3C ITS
  • OASIS XLIFF
  • OASIS Translation Web Services
  • OASIS DITA Translation
  • OASIS OAXAL
  • ETSI LIS
  • DITA Localization
  • Interoperability Now!
  • Linport

Andrzej has been responsible for the architecture of the essential word and character count GMX-V (Global Information Management Metrics eXchange) standard, as well as the revolutionary xml:tm (XML based text memory) standard which will change the way in which we view and use translation memory.

Andrzej is also chair of the OASIS OAXAL (Open Architecture for XML Authoring and Localization) reference architecture technical committee which provides an automated environment for authoring and localization based on Open Standards.

He has worked in IT since 1976 and has been responsible for major successful projects at Xerox, SDL, Oxford University Press, Ford of Europe, DocZone and Lingo24 in the fields of document imaging, dictionary systems and localization.

Andrzej is currently working on new advances in localization technology based on XML and linguistic methodology.

Highlights of his career include:

1. The design and architecture of the European Patent Office patent data capture system for Xerox Business Services.

2. Writing a system for the automated optimal typographical formatting of generically encoded tables (1989).

3. The design and architecture of the Xerox Language Services XTM translation memory system.

4. Writing the XML and SGML filters for SDL International’s SDLX Translation Suite.

5. Assisting the Oxford University Press, the British Council and Oxford University in work on the New Dictionary of the National Biography.

6. Design and architecture of Ford’s revolutionary CMS Localization system and workflow.

7. Technical Architect of XTM International’s revolutionary Cloud based CAT and translation workflow system: XTM.

Specific areas of specialization:

1.  Advanced automated localization workflow
2. Author memory

3. Controlled authoring

4. Advanced Translation memory systems

5. Terminology extraction

6. Terminology Management

7.  Translation Related Web Services

8. XML based systems

9. Web 2.0 Translation related technology


The Dos and Don’ts of XML document localization

 

XML is now ubiquitous: from Microsoft Office to XHTML and Web Services it is at the core of electronic data communications. The separation of form and content, which is inherent within the concept of XML, makes XML documents easier to localize than those created with traditional proprietary text processing or composition systems.

Nevertheless, decisions made during the creation of the XML structure and authoring of documents can have a significant effect on the ease with which the source language text can be localized. For example, the inappropriate use of syntactical tools can have a profound effect on translatability and cost. It may even require complete re-authoring of documents in order to make them translatable.

This presentation highlights the potential pitfalls in XML document design regarding ease of translation and provides concrete guidance on how to avoid them.

XML is now ubiquitous: from Microsoft Office to XHTML and Web Services it is at the core of electronic data communications.

XML is now at the centre of all electronic documentation.

The separation of form and content, which is inherent within the concept of XML, makes XML documents easier to localize than those created with traditional proprietary text processing or composition systems.

Nevertheless, decisions made during the creation of the XML structure and authoring of documents can have a significant effect on the ease with which the source language text can be localized. For example, the inappropriate use of syntactical tools can have a profound effect on translatability and cost. It may even require complete re-authoring of documents in order to make them translatable. This is worth noting, as a very high proportion of XML documents are candidates for translation into other languages.

The translation industry was quite slow in adapting to the specific nature of XML treating it as just another electronic format: XML is much more than that – it is an extensible markup language that can define new vocabularies. It has become the single most important format that transcends all electronic documents from web pages through word processing to artistic text formatting. A thorough understanding of how XML works and how to adapt XML to localization is required to translate XML documents correctly. A lot of damage in terms of increased costs and lack of effectiveness can occur at the initial stages of XML implementation: done incorrectly and your costs can escalate disproportionately.

This presentation highlights the potential pitfalls in XML document design regarding ease of translation and provides detailed concrete guidance on how to avoid them.

A must for anyone involved in translating XML documentation.