Andrzej Zydroń is CTO at XTM International and is a well known IT expert on Localization and related Open Standards. Zydroń sits/has sat on the following Open Standard Technical Committees: LISA OSCAR GMX, LISA OSCAR xml:tm, LISA OSCAR TBX, W3C ITS, OASIS XLIFF, OASIS Translation Web Services, OASIS DITA Translation, OASIS OAXAL, ETSI LIS, DITA Localization, Interoperability Now! and Linport.
Zydroń has been responsible for the architecture of the essential word and character count GMX-V (Global Information Management Metrics eXchange) standard, as well as the revolutionary xml:tm standard which will change the way in which we view and use translation memory. He is also head of the OASIS OAXAL (Open Architecture for XML Authoring and Localization) technical committee.Zydroń has worked in IT since 1976 and has been responsible for major successful projects at Xerox, SDL, Oxford University Press, Ford of Europe, DocZone and Lingo24.

Andrzej Zydroń

will present…

The Tipping Point: further significant advances in Machine Translation

 

Abstract

If I have seen further, it is by standing on the shoulders of giants…
Sir Isaac Newton 15 February 1676

Introduction

In 1978 Kurzweil Computer Products launched the first commercial Optical Character Recognition (OCR) product. It had mixed reviews and was reasonably good for clean Courier font published text. Initial quality on a mix of typefaces was only around 90% which was too low for general acceptance. By 1990 overall quality for commercial OCR products had reached 97%, which was still too low, but by 1999 overall quality had reached 99%: the tipping point where general use and acceptance of OCR was a given. We are approaching a similar point with Machine Translation (MT).

Machine Translation

The great advances in MT over the past decade were the result of a number of technologies coming together at the right time:
1. Big Data, in the form of the availability of very large parallel multilingual corpora
2. The alignment of bilingual corpora at the sentence/segment level
3. The application of Bayesian probabilities to work out the word and phrase alignments for each segment in the form of Hidden Markov Models (HMM)
4. The use of the word and phrase alignments to ‘decode’ new text

SMT has proved to be a very important advance in MT. Whereas previous attempts at MT could only deal with very restricted language pairs and controlled source language text SMT is totally unconstrained in this respect and given enough data can attempt any language pair given enough aligned segment data.

Great improvements have been made to the basic SMT models being used, nevertheless the core concept of SMT remains: ‘guessing’ based on the Bayesian probability model what are the most probable alignments between the source and target languages at the word and/or phrase level.

The basic, fundamental premise of SMT is that we do not have access to bilingual dictionary data, so the only way is to try and work out the most probable word and phrase alignments.

The flying FALCON

FALCON (http://falcon-project.eu/) is a European Union funded FP7 project comprising Trinity College Dublin, Dublin City University, Easyling/SKAWA, Interverbum/TermWeb and XTM International. FALCON stands for Federated Active Linguistic data CuratiON and is largely the brainchild of David Lewis, Research Fellow at Trinity College Dublin. FALCON initially had the following important goals:
1. To establish a formal standard model for Linked Language and Localisation Data as a federated platform for data sharing based on a RDF metadata schema.
2. To integrate the Skawa/Easyling proxy based web site translation solution, Interverbum/TermWeb web based advanced terminology management and XTM web based translation management and computer assisted translation products in one seamless platform.
3. To integrate and improve SMT performance benefitting from the L3Data federated model as an integral part of the project as well as integration of the DCU engine with XTM.

During the initial investigations concerning improvements to phrase handling etc. we ‘stumbled’ across BabelNet. I had been aware of BabelNet (http://www.babelnet.org) previously, but the implications did not register until I started to play around with the datasets and the API.

The Tower of Babel(Net)

BabelNet is a truly marvellous project funded by the European Research Council (it is part of the MultiJEDI (Multilingual Joint word sensE DIsambiguation) project). BabelNet is a multilingual lexicalized semantic network and ontology. What is truly impressive about BabelNet is its sheer size, quality and scope: BabelNet 2.5 contains 9.5 million entries across 50 languages. This is truly Big Lexical Data. What is truly astounding about BabelNet is the sheer size, breadth and depth of the semantic data:

Big Lexical data has the potential to remove the ‘blindfolds’ that have shackled SMT to date, significantly improving both accuracy and performance through bilingual dictionaries and word sense disambiguation.

BabelNet will continue to grow in size and scope over the next few years adding further online dictionary data such as IATE (http://iate.europa.eu/) and other multilingual open data resources.
The 21st century has ushered in significant advances in the understanding of how human intelligence works at the systems level. What is intelligence and how does it work are subjects that have only recently been addressed. The seminal work by Jeff Hawkins, who has been the primary pioneer in these hitherto uncharted waters, has had a profound effect on our understanding of how the human brain actually functions in the computing sense.

Hawkins’ theories have had a profound effect on the next generation of both computer hardware and software engineers. The single pipe Turing architecture has reached its limits. All attempts at building true artificial intelligence based on current ideas and notions have failed to deliver. Deep rural networks and aligned techniques have failed to provide any advance in our attempt at harnessing the potential of creating true artificial intelligence.

Time to take a different approach. This requires reverse engineering what human and mammal brains do effortlessly and current software engineering attempts have failed spectacularly to deliver. Take the simple matter of a young dog running and catching a ball in mid flight. A two year old dog does it effortlessly. To program a robot to do this would require around 50 man years of effort and is currently beyond the scope of any organisation apart from possibly the US Department of Defence.

Human and mammalian brains are extremely slow in comparison with todays processors and yet there capacity to learn and react to their environment is astonishing. Until Jeff Hawkins’ seminal work On Intelligence there were no good or bad theories: there were none. Hawkins has laid out the architectural and computing basis for intelligence and how we can harness this in the next generation of computer architectures which are radically different than the Turing architecture that is used by today’s computers.

The work of Jeff Hawkins has been fundamental in furthering our understanding of intelligence. Its impact on machine translation will be significant. The effects of this new approach will have significant implications over the next 20 years. The current generation of machine translation can be described as an advanced form of Mechanical Turk. No understanding is required of the computer: in fact it cannot have any form of understanding. John Searle’s Chinese Room thought experiment highlighted the limitations of our current approach to automated translation.

Jeff Hawkins’ theories centre on the neocortex, its structure and the way we learn and process the world around us, including language. Cortical computing will have a very profound impact on our daily lives and the way we can use truly sapient machines for translation in the future.