(alphabetical order of presenters)
Translating implicit elements in RBMT
This research is focused on Ru <-> En RBMT of asymmetrical linguistic markers. In English overt pronouns are required to mark possessive relations. However in Russian implicit possessives are regularly used. I argue that Russian implicit possessives should be treated as overt pronouns, that is, they should be recognized in text and attributed to their appropriate antecedent. At En -> Ru translation in several cases overt pronouns should be deleted and at Ru -> En translation in several cases explicit possessives should be synthesized subject to their antecedents. I examine how modern Russian MT systems deal with these tasks and explore main problems that exist. I also introduce the rule for En -> Ru MT that helps to increase translation accuracy.The research is based on ABBYY Compreno © machine translation technologies.
AutoLearn < word >
AutoLearn extracts new translation relations for words and multiword expressions of any category from bilingual texts of any size in high quality, prepares the information found as a conventional dictionary entry – with morpho-syntactic and semantic classifications and contextual use conditions. The learning function uses Lingenio’s MT-system and analysis components, which it is integrated in, as knowledge source and adapts the dictionary – and, as a consequence, the MT-system it is connected to – to the needs of the user. Manual intervention is restricted to a very reduced number of difficult cases and can be carried out easily in an ergonomic graphical user interface, without need of effortful training. This is enabled by the underlying MT-architecture with rule-based core and additional statistical features. The use conditions connected to the new dictionary entries are abstracted from the local representation the considered word or expression is part of in the considered reference(s). They restrict the corresponding translation to cases which are similar to the reference(s) the relation has been extracted from, thus avoiding interference with alternative translations contained in the dictionary, if any.A basic version is already available in the current version of Lingenio’s translate.
Filling in the gaps: what we need from subsegment TM recall
Alongside increasing use of Machine Translation (MT) in translator workflows, Translation Memory (TM) continues to be a valuable tool providing complementary functionality, and is a technology that has evolved in recent years, in particular with developments around subsegment recall that attempt to leverage more content from TM data than segment-level fuzzy matching. But how fit-for-purpose is subsegment recall functionality, and how do current CAT tool implementations differ? This paper presents results from the first survey of translators to gauge their expectations of subsegment recall functionality, cross-referenced with a novel typology for describing subsegment recall implementations. Next, performance statistics are given from an extensive series of tests of four leading CAT tools whose implementations approach those expectations. Finally, a novel implementation of subsegment recall, ‘Lift’, will be demonstrated (integrated into SDL Trados Studio 2014), based on subsegment alignment and with no minimum TM size requirement or need for an ‘extraction’ step, recalling fragments and identifying their translations within the segment even with only a single TM occurrence and without losing the context of the match. A technical description will explain why it produces better performance statistics for the same series of tests and in turn meets translator expectations more closely.
Nizar Ghoula and Jacques Guyot
Terminology Management revisited
Large repositories publishing and sharing terminological, ontological and linguistic resources are available to support the development and use of translation. However, despite the availability of language resources within online repositories, some natural languages associations cannot be found (rare languages or non-common combinations, etc.). Consequently, multiple tools for composing linguistic and terminological resources offer the possibility to create missing language associations. These generated resources need to be validated in order to be effectively used. Manually checking these resources is a tedious task and in some cases nearly impossible due to the large amount of entities and associations to go through or due to the lack of expertise in both languages. To resolve this matter and generate sound and safe content, tools are needed to automatically validate and filter associations that make no sense. Hence, a validation tool is based itself on external resources such as parallel corpora which need to be either collected or created and filtered. To solve these matters we propose a set of tools that generate new terminological resources (myTerm) and to filter them using a parallel corpus generated by another tool (myPREP). We describe our methodology for terminology management based on a repository and present its evaluation.
SMT for restricted sublanguage in CAT tool context at the European Parliament
This paper shows that it is possible to efficiently develop Statistical Machine Translation (SMT) systems that are useful for a specific type of sublanguages in a real context of use even when excluding the part of a test set which have an exact match with Translation Memories (TM) in order to be integrated in CAT tools.
Because we believe on the proximity of sublanguages even though it is still hard to practically define a sublanguage, we are proposing on the framework of the MT@EP project at the European Parliament (EP) to develop SMT systems specific for each EP Parliamentary Committee optimised for restricted sublanguages and constrained by the EP’s particular translation requirements.
Sublanguage-specific systems provide better results than generic systems for EP domains showing a very significant quality improvement (5-25% of BLEU score), mainly due to the EP context specificity and to the proximity of sublanguages. This approach is also satisfactory for pairs of under-resourced languages, such as the Slavic families and German.
The sublanguage-specific systems will be integrated in the EP translation workflow to improve TM results offering in priority previous human translations. The development of an algorithm to translate only unmatched segments with MT is in progress.
Miguel A. Jiménez-Crespo
Beyond prescription: What empirical studies are telling us about localization crowdsourcing
During the last two decades we have witnessed the emergence of the digital era that now permeates most aspects of human lives (Cronin 2013; Folaron 2012).
Translation, as the enabler of these transnational flows, has gained increasing attention in this context, and has expanded considerably, due to the democratizing participatory and open nature of the Web 2.0. This dramatic shift has given rise to new phenomena and practices, such as crowdsourcing and collaborative translation practices enabled by web platforms (O´Hagan 2011; Olohan 2013).
The Localization Industry has initially responded to this challenge with prescriptive accounts of crowdsourcing initiatives and of best practices (i.e. European Commission 2012; De Palma and Kelly 2011; Munro 2011).
Translation Studies has recently begun to turn its attention to this phenomenon, mostly from an empirical perspective. Studies have focused on motivation of volunteers to participate (Camara forthcoming; Olohan 2014; McDonough-Dolmaya 2012; O’Brien and Schäler 2010) and corpus-based studies of translation quality and naturalness (Jimenez-Crespo forthcoming, 2013; Olvera and Gutierrez 2012).
This presentation reviews the findings of these studies and interrelates them with the prescriptive accounts from the industry. In doing so, it will help bridge the existing gap between the localization industry and Translation Studies.
Getting the best out of a mixed bag
This presentation discusses the development and implementation of an approach to the combination of machine translation and translation memory tecnologies in a TM vendor and platform independent environment.
In this workflow the machine translation system itself is able to consult and draw upon the content of any number of relevant translation memories and phrase tables containing subsegments (ngrams) prior to the operation of the rule-based stage of a machine translation process.
Since the machine translation engine directly searches for matches in the TMX files it is possible to enjoy the benefits of translation memory without deploying a commercial translation memory application.
The output of the process is a TMX file containing a varying mixture of TM-generated and MT-generated sentences. Translators can import this file into their respective translation memory applications for post-editing or “sanity checking”.
The author has designed this workflow using his own language engineering tools written in Java. However, this workflow could be easily implemented using NLP tools available in the Open Source community.
Rule-based automatic post-processing of SMT output to reduce human post-editing effort: a case study
User-generated content (UGC) now represents a large share of the informative content available on the web. However, its uneven quality can hinder both readability and machine-translatability, preventing sharing of knowledge between language communities.
The ACCEPT project (http://www.accept-project.eu/) aims at solving this issue by improving Statistical Machine Translation (SMT) of community content through minimally-intrusive pre-editing techniques, SMT improvement methods and post-editing strategies. Within this project, we have developed linguistic post-editing rules intended to reduce post-editing effort, by automatically correcting the most frequent errors before submitting MT output to the post-editor.
In the present study, we focus on English to French SMT and describe and evaluate post-editing rules for French. The post-editing rules treat two types of phenomena: (1) MT-specific errors, and (2) general spelling and grammar rules.
To quantify the usefulness of these rules, we developed a tool that checks automatically if the post-editors have kept the modifications produced by the rules. In order to evaluate this tool, we will compare the obtained results with those produced by a manual evaluation.
Using Cross-Language Information Retrieval and Meaning-Text Theory in Example-Based Machine Translation
In this presentation, I will present the CEA LIST Example-Based Machine Translation (EBMT) prototype which uses a hybrid approach combining cross-language information retrieval and statistical language modelling.This approach consists, on the one hand, in indexing a database of sentences in the target language and considering each sentence to translate as a query to that database, and on the other hand, in evaluating sentences returned by a cross-language search engine against a statistical language model of the target language in order to obtain the n-best list of translations.The English-French EBMT prototype has been compared to the state-of-the-art Statistical Machine Translation system MOSES and experimental results show that the proposed approach performs best on specialized domains.
Twitter Crowd Translation – Design and Objectives
Co-authors: Eduard Šubert and Ondřej Bojar
In this presentation we will present how, in our project we aim to build an online infrastructure for providing translation to social media and gathering relevant training data to support machine translation of such content.
We endeavour to solve this inadequacy by crowdsourcing.
Tengku Sepora Tengku Mahadi
Losses and Gains in Computer-Assisted Translation: Some Remarks on Online Translation of English to Malay
The paper begins with a concise investigation of the significance of the translation technology in modern life as well as machine and computer-assisted translation. It then defines the technology accessible to translators and examines the losses and gains of the applied tools in computer-assisted translation include electronic dictionaries that conventionally divided into online and offline dictionaries. Subsequently, the paper studies the influence of the online dictionaries on the professional translator, concluding that to what extent can translation be accurate.
Loss in machine translation is inevitable due to the differences between English and Malay as entirely two different languages and not-related language pairs for translation.
The online dictionary and translation software cannot replace the human translator and guarantee high-quality translations. Online dictionaries and other translation means accelerate and facilitate the translation process only by minimising the required time for translation.
The aim of the paper is to investigate the new technologies in machine translation tools in order to investigate the losses and gains in translation of English to Malay by using online dictionaries.
Anne Marie Taravella
Affective Impact of the use of Technology on Employed Language Specialists: An Exploratory Qualitative Study
A well-established fact in the information systems literature is the importance of human aspects of technology use.
In our doctoral research, we look into the emotional effort that employed language specialists have to put in their daily work, in the light of an increased use of language technology tools (LTT) by language service providers.
In 2011 and 2012, we conducted qualitative studies to understand how LTT were perceived by language specialists. We observed translators and other language specialists at work and conducted 12 in-depth interviews. We noticed that respondents often mentioned affective constructs, such as stress or anxiety, even when not prompted to describe their affective state.
We then reanalyzed our transcripts and written notes in search for answers to the following specific question: “What affective variables do language specialists spontaneously mention when asked to describe their use of LTT?” Using content analysis, we found that respondents often mention some form of occupational stress, or relief of occupational stress, along with other affective variables, in relation with the use of LTT.
We argue that emotional well-being and stress relief should be measured and serve as a guide for the design and implementation of language technology tools.
Is Machine Translation Ready for Literature?
Given the current maturity of Machine Translation (MT), demonstrated by its growing adoption by industry (where it is mainly used to assist with the translation of technical documentation), we believe now is the time to assess the extent to which MT is useful to assist with translating literary text.
Our empirical methodology relies on the fact that the applicability of MT to a given type of text can be assessed by analysing parallel corpora of that particular type and measuring (i) the degree of freedom of the translations (how literal are the translations) and (ii) the narrowness of the domain (how specific or general that text is). Hence, we tackle the problem of measuring the translatability of literary text by comparing the degree of freedom of translation and domain narrowness for such texts to texts in two other domains which have been widely studied in the area of MT: technical documentation and news.
Moreover, we present a pilot study on MT for literary text where we translate a novel between two Romance languages.
The automatic evaluation results (66.2 BLEU points and 23.2 TER points) would be considered, in an industrial setting, as extremely useful for assisting human translation.
Improving fuzzy matching through syntactic knowledge
Fuzzy matching in translation memories (TM) is mostly string-based in current CAT tools. These tools look for TM sentences highly similar to an input sentence, using edit distance to detect the changes required to convert one sentence to another. Current CAT tools use limited or no linguistic knowledge in this procedure.
In the recently started SCATE project, which aims at improving translators’ efficiency, we apply syntactic fuzzy matching in order to detect abstract similarities and to increase the number of fuzzy matches. We parse TM sentences in order to create hierarchical structures identifying constituents or dependencies. We calculate TER (Translation Edit Rate) between an existing human translation of an input sentence and the translation of its fuzzy matches in TM. This allows us to assess the usefulness of syntactic matching with respect to string-based matching.
In an extended scenario, we pretranslate parts of an input sentence by combining fuzzy matches with the word alignment of a statistical MT system applied to TM.
The output of the system, which deals with the untranslated parts, is compared to the existing human translation.
Integrating Machine Translation (MT) in the Higher Education of Translators and Technical Writers
Language experts in the technical communication field such as technical writers and technical translators today have to develop and apply strategies on how to cope with the increasing amount of documentation to be written and translated.
Controlled language in the pre-editing step of MT is one of the methodologies that could offer a solution to produce acceptable target texts that may as well be post-edited, depending on the target audience and quality level expected.
This presentation will show how students in the Master programme at the Flensburg University of Applied Sciences can combine their language skills and the MT technology, and how their work will result in a better understanding and acceptance of MT.
The Dos and Don’ts of XML document localization
XML is now ubiquitous: from Microsoft Office to XHTML and Web Services it is at the core of electronic data communications. The separation of form and content, which is inherent within the concept of XML, makes XML documents easier to localize than those created with traditional proprietary text processing or composition systems.
Nevertheless, decisions made during the creation of the XML structure and authoring of documents can have a significant effect on the ease with which the source language text can be localized. For example, the inappropriate use of syntactical tools can have a profound effect on translatability and cost. It may even require complete re-authoring of documents in order to make them translatable.
This presentation highlights the potential pitfalls in XML document design regarding ease of translation and provides concrete guidance on how to avoid them.