Machine Translation, vol.21, no.1, March 2007, pp.1-28


A method of creating new valency entries

Sanae Fujita • Francis Bond

Received: 25 April 2006 / Accepted: 20 February 2008 / Published online: 28 June 2008
Abstract Information on subcategorization and selectional restrictions in a valency dictionary is important for natural language processing tasks such as monolingual parsing, accurate rule-based machine translation and automatic summarization. In this paper we present an efficient method of assigning valency information and selectional restrictions to entries in a bilingual dictionary, based on information in an existing valency dictionary. The method is based on two assumptions: words with similar meaning have similar subcategorization frames and selectional restrictions; and words with the same translations have similar meanings. Based on these assumptions, new valency entries are constructed for words in a plain bilingual dictionary, using entries with similar source-language meaning and the same target-language translations. We evaluate the effects of various measures of semantic similarity.

Keywords   Valency dictionary • Bilingual dictionary • Similarity • Merge


Machine Translation, vol.21, no.2, March 2007, pp.29-53


Methods for extracting and classifying pairs of cognates
and false friends

Ruslan Mitkov • Viktor Pekar • Dimitar Blagoev •
Andrea Millioni

Received: 18 January 2007 / Accepted: 27 February 2008 / Published online: 17 May 2008
Abstract The identification of cognates has attracted the attention of researchers working in the area of Natural Language Processing, but the identification of false friends is still an under-researched area. This paper proposes novel methods for the automatic identification of both cognates and false friends from comparable bilingual corpora. The methods are not dependent on the existence of parallel texts, and make se of only monolingual corpora and a bilingual dictionary necessary for the mapping of co-occurrence data across languages. In addition, the methods do not require that the newly discovered cognates or false friends are present in the dictionary and hence are capable of operating on out-of-vocabulary expressions. These methods are evaluated on English, French, German and Spanish corpora in order to identify English-French, English-German, English-Spanish and French-Spanish pairs of cognates or false friends. The experiments were performed in two settings: (i) assuming ‘ideal’ extraction of cognates and false friends from plain-text corpora, i.e. when the evaluation data contains only cognates and false friends, and (ii) a real-world extraction scenario where cognates and false friends have to first be identified among words found in two comparable corpora in different languages. The evaluation results show that the developed methods identify cognates and false friends with very satisfactory results for both recall and precision, with methods that incorporate background semantic knowledge, in addition to co-occurrence data obtained from the corpora, delivering the best results.

Keywords Cognates • Faux amis • Orthographic similarity • Distributional similarity • Semantic similarity • Translational equivalence


Machine Translation, vol.21, no.1, March 2007, pp.55-68


Power shifts in web-based translation memory

Ignacio Garcia

Received: I August 2007 / Accepted: 20 February 2008 / Published online: 18 April 2008
Abstract Web-based translation memory (TM) is a recent and little-studied development that is changing the way localisation projects are conducted. This article looks at the technology that allows for the sharing of TM databases over the internet to find out how it shapes the translator’s working environment. It shows that so-called pre-translation—until now the standard way for clients to manage translation tasks with freelancers—is giving way to web-interactive translation. Thus, rather than interacting with their own desktop databases as before, translators now interface with each other through server-based translation memories, so that a newly entered term or segment can be retrieved moments later by another translator working at a remote site. The study finds that, while the interests of most stakeholders in the localisation process are well served by this web-based arrangement, it can involve drawbacks for freelancers. Once an added value, technical expertise becomes less of a determining factor in employability, while translators lose autonomy through an inability to retain the linguistic assets they generate. Web-based TM is, therefore, seen to risk disempowering and de-skilling freelancers, relegating them from valued localisation partners to mere servants of the new technology.

Keywords    Translation memory • Localization • Internationalization • Machine-aided translation • Web-based translation


Machine Translation, vol.21, no.2, June 2007, pp.77-94


Semi-supervised model adaptation for statistical
machine translation

Nicola Ueffing • Gholamreza Haffari • Anoop Sarkar

Received: 31 July 2007 / Accepted: 23 April 2008 / Published online: 10 June 2008
Abstract Statistical machine translation systems are usually trained on large amounts of bilingual text (used to learn a translation model), and also large amounts of monolingual text in the target language (used to train a language model). In this article we explore the use of semi-supervised model adaptation methods for the effective use of monolingual data from the source language in order to improve translation quality. We propose several algorithms with this aim, and present the strengths and weaknesses of each one. We present detailed experimental evaluations on the French-English EuroParl data set and on data from the NIST Chinese-English large-data track. We show a significant improvement in translation quality on both tasks.

Keywords    Statistical machine translation • Self-training • Semi-supervised learning • Domain adaptation • Model adaptation


Machine Translation, vol.21, no.2, June 2007, pp.95-119


Evaluating machine translation with LFG dependencies

Karolina Owczarzak • Josef van Genabith • Andy Way

Received: 31 October 2007 / Accepted: 22 May 2008 / Published online: 6 August 2008
Abstract In this paper we show how labelled dependencies produced by a Lexical-Functional Grammar parser can be used in Machine Translation evaluation. In contrast to most popular evaluation metrics based on surface string comparison, our dependency-based method does not unfairly penalize perfectly valid syntactic variations in the translation, shows less bias towards statistical models, and the addition of WordNet provides a way to accommodate lexical differences. In comparison with other metrics on a Chinese-English newswire text, our method obtains high correlation with human scores, both on a segment and system level.

Keywords Machine translation • Evaluation metrics • Lexical-Functional Grammar • Labelled dependencies


Machine Translation, vol.21, no.2, June 2007, pp.121-133


Capturing practical natural language transformations

Kevin Knight

Received: 19 March 2008 / Accepted: 10 July 2008 / Published online: 6 August 2008
Abstract We study automata for capturing the transformations in practical natural language processing (NLP) systems, especially those that translate between human languages. For several variations of finite-state string and tree transducers, we survey answers to formal questions about their expressiveness, modularity, teachability, and generalization. We conclude that no formal device yet captures everything that is desirable, and we point to future research.

Keywords    Translation • Automata


Machine Translation, vol.21, no.3, September 2007, pp.139-163


Automatic extraction of translations from web-based bilingual materials

Qibo Zhu • Diana Inkpen • Ash Asudeh

Received: 14 September 2007 / Accepted: 7 August 2008 / Published online: 20 September 2008
Abstract This paper describes the framework of the StatCan Daily Translation Extraction System (SDTES), a computer system that maps and compares web-based translation texts of Statistics Canada (StatCan) news releases in the StatCan publication The Daily. The goal is to extract translations for translation memory systems, for translation terminology building, for cross-language information retrieval and for corpus-based machine translation systems. Three years of officially published statistical news release texts at were collected to compose the StatCan Daily data bank. The English and French texts in this collection were roughly aligned using the Gale-Church statistical algorithm. After this, boundary markers of text segments and paragraphs were adjusted and the Gale-Church algorithm was run a second time for a more fine-grained text segment alignment. To detect misaligned areas of texts and to prevent mismatched translation pairs from being selected, key textual and structural properties of the mapped texts were automatically identified and used as anchoring features for comparison and misalignment detection. The proposed method has been tested with web-based bilingual materials from five other Canadian government websites. Results show that the SDTES model is very efficient in extracting translations from published government texts, and very accurate in identifying mismatched translations. With parameters tuned, the text-mapping part can be used to align corpus data collected from official government websites; and the text-comparing component can be applied in prepublication translation quality control and in evaluating the results of statistical machine translation systems.

Keywords Automatic translation extraction • Bitext mapping • Machine translation • Parallel alignment • Translation memory system


Machine Translation, vol.21, no.1, September 2007, pp.165-181


Pivot language approach for phrase-based statistical machine translation

Hua Wu • Haifeng Wang

Received: 1 April 2008 / Accepted: 11 August 2008 / Published online: 23 September 2008
Abstract This paper proposes a novel method for phrase-based statistical machine translation based on the use of a pivot language. To translate between languages Ls and Lt with limited bilingual resources, we bring in a third language, Lp, called the pivot language. For the language pairs Ls—Lp and Lp — Lt, there exist large bilingual corpora. Using only Ls — Lp and Lp—Lt bilingual corpora, we can build a translation model for Ls — Lt. The advantage of this method lies in the fact that we can perform translation between Ls and Lt even if there is no bilingual corpus available for this language pair. Using BLEU as a metric, our pivot language approach significantly outperforms the standard model trained on a small bilingual corpus. Moreover, with a small Ls — Lt bilingual corpus available, our method can further improve translation quality by using the additional Ls — Lp and Lp — Lt bilingual corpora.

Keywords    Pivot language • Phrase-based statistical machine translation • Scarce bilingual resources


Machine Translation, vol.21, no.4, December 2007, pp.187-207


Bilingual LSA-based adaptation for statistical machine translation

Yik-Cheung Tam • Ian Lane • Tanja Schultz

Received: 27 March 2008 / Accepted: 31 October 2008 / Published online: 19 November 2008
Abstract We propose a novel approach to cross-lingual language model and translation lexicon adaptation for statistical machine translation (SMT) based on bilingual latent semantic analysis. Bilingual LSA enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bilingual LSA framework, model adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying the inferred distribution to an n-gram language model of the target language and translation lexicon via marginal adaptation. The background phrase table is enhanced with the additional phrase scores computed using the adapted translation lexicon. The proposed framework also features rapid bootstrapping of LSA models for new languages based on a source LSA model of another language. Our approach is evaluated on the Chinese-English MT06 test set using the medium-scale SMT system and the GALE SMT system measured in BLEU and NIST scores. Improvement in both scores is observed on both systems when the adapted language model and the adapted translation lexicon are applied individually. When the adapted language model and the adapted translation lexicon are applied simultaneously, the gain is additive. At the 95% confidence interval of the unadapted baseline system, the gain in both scores is statistically significant using the medium-scale SMT system, while the gain in the NIST score is statistically significant using the GALE SMT system.

Keywords Bilingual latent semantic analysis • Latent Dirichlet-tree allocation • Cross-lingual language model adaptation • Lexicon adaptation • Topic distribution transfer • Statistical machine translation


Machine Translation, vol.21, no.4, December 2007, pp.209-252


Simultaneous translation of lectures and speeches

Christian Fόgen • Alex Waibel • Muntsin Kolss

Received: 27 August 2008 / Accepted: 4 November 2008 / Published online: 22 November 2008
Abstract With increasing globalization, communication across language and cultural boundaries is becoming an essential requirement of doing business, delivering education, and providing public services. Due to the considerable cost of human translation services, only a small fraction of text documents and an even smaller percentage of spoken encounters, such as international meetings and conferences, are translated, with most resorting to the use of a common language (e.g. English) or not taking place at all. Technology may provide a potentially revolutionary way out if real-time, domain-independent, simultaneous speech translation can be realized. In this paper, we present a simultaneous speech translation system based on statistical recognition and translation technology. We discuss the technology, various system improvements and propose mechanisms for user-friendly delivery of the result. Over extensive component and end-to-end system evaluations and comparisons with human translation performance, we conclude that machines can already deliver comprehensible simultaneous translation output. Moreover, while machine performance is affected by recognition errors (and thus can be improved), human performance is limited by the cognitive challenge of performing the task in real time.

Keywords    Simultaneous translation • Interpretation • Speech-to-speech translation • Spoken language translation • Machine translation • Speech recognition • Lecture recognition • Lectures • Speeches