Martin Benjamin is the founder and executive director of the Kamusi Project, an international collaborative effort to produce learning and lexical resources for African languages, and a Senior Scientist in the Distributed Information Systems Laboratory at the École Polytechnique Fédéral de Lausanne.He received a doctorate in Anthropology from Yale University for research on health and development. He has since conducted applied research in Tanzania on health and gender, particularly issues of maternal health and anemia, with support from UNICEF, Danida (Denmark) and the Micronutrient Initiative (Canada). He has consulted extensively with the World Bank Indigenous Knowledge Program, and was the main translator for the Google interface in Swahili for several years. He also taught Anthropology and Swahili at Wesleyan University in the US until 2006, before leaving the academy to pursue the development of the Kamusi Project full time. Benjamin founded the Kamusi Project in 1994, when he was a graduate student looking for dictionary resources to aid his learning of Swahili. He conceived of a participatory model in which community members shared the tasks of creating a comprehensive Swahili dictionary, entry by entry. With the support of grants from the US Department of Education and an inter-university Consortium on Language Teaching and Learning, the Kamusi Project grew into the world’s most used resource for the Swahili language, with more than 60,000 entries and over a million lookups per month. In 2007, as the Kamusi model expanded to cover additional African languages that fell outside of the core university teaching mission, Yale released the project to the World Language Documentation Centre, a UK-based public charity. Subsequently, Benjamin shepherded the project through the process of formation as two independent legal non-profit entities, Kamusi Project USA for activities that originate in the US, and Swiss-based Kamusi Project International for working with partners worldwide. During the past eight years, Benjamin has led several new undertakings for the Kamusi Project. He was the primary investigator on a grant from the National Endowment for the Humanities for a trilingual dictionary among Kinyarwanda (the language of Rwanda), Swahili, and English. He has also been an active member of the African Network for Localisation, where, with the support of the Canadian International Development Research Centre, he has managed projects to create computer locales for nearly 100 African languages, and to create comprehensive software localization terminologies for a dozen languages. He has also worked with Microsoft to refine their Swahili technology glossary and their translations for Windows and Office – in 2011 he was recognized by the Microsoft Local Language Program “for outstanding contributions to the release of Windows 7 and Office 2010 in Kiswahili”. Kamusi was recognized as a launch partner in the White House Big Data Initiative in 2013. Benjamin is currently focused on two major activities for the Kamusi Project. The first is the Global Online Living Dictionary (Kamusi GOLD), which will be a major lexical resource that interlinks numerous languages around the world; he led a pilot with more than 20 languages in 2013, and is now working with partners for in depth development of languages including Songhay, Kirundi and Vietnamese. The second is KamusiTERMS (Kamusi for Technology, Economy, Rights, Medicine, and Science), which will be a multilingual, multi-domain term bank for African development. These projects involve the coordination of numerous partners throughout Africa and elsewhere, including software developers, language experts, and domain specialists. Benjamin resides in Lausanne, Switzerland, and travels frequently to pursue Kamusi Project activities with partners far and wide.

Martin Benjamin

will present the poster…

Kamusi Pre:D – Source-Side Disambiguation and a Sense Aligned Multilingual Lexicon

Co-authored by Amar Mukunda and Jeff Allen


Kamusi Pre:D offers a new approach to translation by disambiguating word senses on the source side, and matching those senses to human-confirmed vocabulary equivalents in any target language. Pre:D has the potential for much greater term accuracy than statistical approaches, which must be performed anew for each language pair and which become increasingly unreliable as the number of senses increases. In the Kamusi Pre:D interface, documents are matched against the multilingual Kamusi Project lexicon. Terms that have multiple senses in the Kamusi dataset are highlighted in the source document, much as misspellings are in a spellchecker. When the user hovers over a term flagged as ambiguous, the various sense definitions are displayed. After the user selects the intended meaning, a known equivalent term for the target language is passed to CAT or MT.

The Kamusi lexicon is an expanding resource that is working toward monolingual sense-disambiguated dictionaries for each language, with parallel or similar concepts marked and linked across languages to create a multilingual semantic matrix. Within the data design, multi-word expressions are treated as lexicalizable concepts and marked for potential separability; Pre:D will locate MWEs that are in the database in the source language, whether or not they are separated, and suggest appropriate translation equivalents. Similarly, word forms are stored as data elements associated with a specific sense of a lemma, so inflections can be identified and mapped back to the definition, even for MWEs.

Important to the functioning of Kamusi Pre:D is responsiveness to missing entries. If a user does not find the intended sense of a term among the options, the source sentence can be transmitted to Kamusi as a contextual example for the production of a new entry. Terms that exist in the source language data but have not been produced in target languages will be given elevated priority in the workflow, with the potential for participants in Kamusi lexicon development to provide reliable vocabulary equivalents for missing vocabulary within a workable timeframe. Kamusi has a crowdsourcing system for members of a speaker community to play games that result in validated language data, to which missing terms will be submitted, and from which results will be incorporated in the larger data set and also transmitted to the original requester.

The first generation of the Pre:D software is intended to be ready for demonstration by the time of the TC37 meeting. Pipeline features, including full MWE support and push/pull integration with lexicon development, will be added as soon as core features are operational. When complete, Kamusi Pre:D will be ported as a front-end service to provide vocabulary for CAT and MT applications. Individual users will find Pre:D to be an essential tool for accurate vocabulary translation among a wide range of language pairs, most currently unserved, while organizations will recognize significant advantages in time, effort, and quality by disambiguating a document one time for concepts that can be rendered appropriately across numerous languages.