Great improvements have been made to the basic SMT models being used, nevertheless the core concept of SMT remains: ‘guessing’ based on the Bayesian probability model what are the most probable alignments between the source and target languages at the word and/or phrase level.
The basic, fundamental premise of SMT is that we do not have access to bilingual dictionary data, so the only way is to try and work out the most probable word and phrase alignments.
The flying FALCON
FALCON (http://falcon-project.eu/) is a European Union funded FP7 project comprising Trinity College Dublin, Dublin City University, Easyling/SKAWA, Interverbum/TermWeb and XTM International. FALCON stands for Federated Active Linguistic data CuratiON and is largely the brainchild of David Lewis, Research Fellow at Trinity College Dublin. FALCON initially had the following important goals:
1. To establish a formal standard model for Linked Language and Localisation Data as a federated platform for data sharing based on a RDF metadata schema.
2. To integrate the Skawa/Easyling proxy based web site translation solution, Interverbum/TermWeb web based advanced terminology management and XTM web based translation management and computer assisted translation products in one seamless platform.
3. To integrate and improve SMT performance benefitting from the L3Data federated model as an integral part of the project as well as integration of the DCU engine with XTM.
During the initial investigations concerning improvements to phrase handling etc. we ‘stumbled’ across BabelNet. I had been aware of BabelNet (http://www.babelnet.org) previously, but the implications did not register until I started to play around with the datasets and the API.
The Tower of Babel(Net)
BabelNet is a truly marvellous project funded by the European Research Council (it is part of the MultiJEDI (Multilingual Joint word sensE DIsambiguation) project). BabelNet is a multilingual lexicalized semantic network and ontology. What is truly impressive about BabelNet is its sheer size, quality and scope: BabelNet 2.5 contains 9.5 million entries across 50 languages. This is truly Big Lexical Data. What is truly astounding about BabelNet is the sheer size, breadth and depth of the semantic data:
Big Lexical data has the potential to remove the ‘blindfolds’ that have shackled SMT to date, significantly improving both accuracy and performance through bilingual dictionaries and word sense disambiguation.
BabelNet will continue to grow in size and scope over the next few years adding further online dictionary data such as IATE (http://iate.europa.eu/) and other multilingual open data resources.
The 21st century has ushered in significant advances in the understanding of how human intelligence works at the systems level. What is intelligence and how does it work are subjects that have only recently been addressed. The seminal work by Jeff Hawkins, who has been the primary pioneer in these hitherto uncharted waters, has had a profound effect on our understanding of how the human brain actually functions in the computing sense.
Hawkins’ theories have had a profound effect on the next generation of both computer hardware and software engineers. The single pipe Turing architecture has reached its limits. All attempts at building true artificial intelligence based on current ideas and notions have failed to deliver. Deep rural networks and aligned techniques have failed to provide any advance in our attempt at harnessing the potential of creating true artificial intelligence.
Time to take a different approach. This requires reverse engineering what human and mammal brains do effortlessly and current software engineering attempts have failed spectacularly to deliver. Take the simple matter of a young dog running and catching a ball in mid flight. A two year old dog does it effortlessly. To program a robot to do this would require around 50 man years of effort and is currently beyond the scope of any organisation apart from possibly the US Department of Defence.
Human and mammalian brains are extremely slow in comparison with todays processors and yet there capacity to learn and react to their environment is astonishing. Until Jeff Hawkins’ seminal work On Intelligence there were no good or bad theories: there were none. Hawkins has laid out the architectural and computing basis for intelligence and how we can harness this in the next generation of computer architectures which are radically different than the Turing architecture that is used by today’s computers.
The work of Jeff Hawkins has been fundamental in furthering our understanding of intelligence. Its impact on machine translation will be significant. The effects of this new approach will have significant implications over the next 20 years. The current generation of machine translation can be described as an advanced form of Mechanical Turk. No understanding is required of the computer: in fact it cannot have any form of understanding. John Searle’s Chinese Room thought experiment highlighted the limitations of our current approach to automated translation.
Jeff Hawkins’ theories centre on the neocortex, its structure and the way we learn and process the world around us, including language. Cortical computing will have a very profound impact on our daily lives and the way we can use truly sapient machines for translation in the future.