and the Lexicon
Third International EAMT Workshop
(Lecture Notes in Artificial Intelligence 898)
[ISBN: 3 540 59040 4]
Machine-readable versions of everyday dictionaries have been seen as a likely source of information for use in natural language processing because they contain an enormous amount of lexical and semantic knowledge. However, after 15 years of research, the results appear to be disappointing. No comprehensive evaluation of machine-readable dictionaries (MRDs) as a knowledge source has been made to date, although this is necessary to determine what, if anything, can be gained from MRD research. To this end, this paper will first consider the postulates upon which MRD research has been based over the past fifteen years, discuss the validity of these postulates, and evaluate the results of this work. We will then propose possible future directions and applications that may exploit these years of effort, in the light of current directions in not only NLP research, but also fields such as lexicography and electronic publishing.
This paper deals with multiword lexemes (MWLs), focussing on two types of verbal MWLs: verbal idioms and support verb constructions. We discuss the characteristic properties of MWLs, namely non-standard compositionality, restricted substitutability of components, and restricted morpho-syntactic flexibility, and we show how these properties may cause serious problems during the analysis, generation, and transfer steps of machine translation systems. In order to cope with these problems, MT lexicons need to provide detailed descriptions of MWL properties. We list the types of information which we consider the necessary minimum for a successful processing of MWLs, and report on some feasibility studies aimed at the automatic extraction of German verbal multiword lexemes from text corpora and machine-readable dictionaries.
The compilation of specialist terminology requires an understanding of how specialists coin and use terms of their specialisms. We show how an exploitation of the pragmatic features of specialist terms will help in the semi-automatic extraction of terms and in the organisation of terms in terminology data banks.
The Cambridge Language Survey is a research project whose activities centre around the use of an Integrated Language Database, whereby a computerised dictionary is used for intelligent cross-reference during corpus analysis - searching for example for all the inflections of a verb rather than just the base form. Types of grammatical coding and semantic categorisation appropriate to such a computerised dictionary are discussed, as are software tools for parsing, finding collocations, and performing sense-tagging. The weighted evaluation of semantic, grammatical, and collocational information to discriminate between word senses is described in some detail. Mention is made of several branches of research including the development of parallel corpora, semantic interpretation by sense-tagging, and the use of a Learner Corpus for the analysis of errors made by non-native-speakers. Sense-tagging is identified as an under-exploited approach to language analysis and one for which great opportunities for product development exist.
Current approaches to computational lexicology in language technology are knowledge-based (competence-oriented) and try to abstract away from specific formalisms, domains, and applications. This results in severe complexity, acquisition and reusability bottlenecks. As an alternative, we propose a particular performance-oriented approach to Natural Language Processing based on automatic memory-based learning of linguistic (lexical) tasks. The consequences of the approach for computational lexicology are discussed, and the application of the approach on a number of lexical acquisition and disambiguation tasks in phonology, morphology and syntax is described.
Typed feature formalisms (TFF) play an increasingly important role in NLP and, in particular, in MT. Many of these systems are inspired by Pollard and Sag's work on Head-Driven Phrase Structure Grammar (HPSG), which has shown that a great deal of syntax and semantics can be neatly encoded within TFF. However, syntax and semantics are not the only areas in which TFF can be beneficially employed. In this paper, I will show that TFF can also be used as a means to model finite automata (FA) and to perform certain types of logical inferencing. In particular, I will (i) describe how FA can be defined and processed within TFF and (ii) propose a conservative extension to HPSG, which allows for a restricted form of semantic processing within TFF, so that the construction of syntax and semantics can be intertwined with the simplification of the logical form of an utterance. The approach which I propose provides a uniform, HPSG-oriented framework for different levels of linguistic processing, including allomorphy and morphotactics, syntax, semantics, and logical form simplification.
This paper aims at providing a broad
overview of the situation in
The software company SAP translates its documentation into more than 12 languages. To support the translation department, SAPterm is used as a traditional terminology database for all languages, and the machine translation system METAL for German-to-English translation. The maintenance of the two terminology databases in parallel, SAPterm and the METAL lexicons, requires a comparison of the entries in order to ensure terminological consistency. However, due to the differences in the structure of the entries in SAPterm and METAL, an automatic comparison has not yet been implemented. The search for a solution has led to the consideration of using another existing SAP tool, called Proposal Pool.
The IBM lexicon and terminology management system TransLexis provides an integrated solution for developing and maintaining lexical and terminological data for use by humans and computer programs. In this paper, the conceptual schema of TransLexis, its user interface, and its import and export facilities will be described. TransLexis takes up several ideas emerging from the reuse discussion. In particular, it strives for a largely theory-neutral representation of multilingual lexical and terminological data, it includes export facilities to derive lexicons for different applications, and it includes programs to import lexical and terminological data from existing sources.
This paper describes the work that was undertaken in the Glossasoft project in the area of terminology management. Some of the draw-backs of existing terminology management systems are outlined and an alternative approach to maintaining terminological data is proposed. The approach which we advocate relies on knowledge-based representation techniques. These are used to model conceptual knowledge about the terms included in the database, general knowledge about the subject domain, application-specific knowledge, and - of course - language-specific terminological knowledge. We consider the multifunctionality of the proposed architecture to be one of its major advantages. To illustrate this, we outline how the knowledge representation scheme, which we suggest, could be drawn upon in message generation and machine-assisted translation.
Translating technical texts may cause many problems concerning terminology, even for the professional technical translator. For this reason, tools such as terminological databases or termbanks have been introduced to support the user in finding the most suitable translation. Termbanks are a type of machine-readable dictionary and contain extensive information on technical terms. But a termbank offers more possibilities than providing users with the electronic version of a printed dictionary. This paper describes a multilingual termbank, which was developed within the ESPRIT project Translator's Workbench. The termbank allows the user to create, maintain, and retrieve specialised vocabulary. In addition, it offers the user the possibility to look up definitions, foreign language equivalents, and background knowledge. In this paper, an introduction to the database underlying the termbank and the user interface is given with the emphasis lying on those functions which initiate the user into a new subject by allowing him or her to navigate through a terminology field. It will be shown how, by clustering the term explanation texts and by linking them to a type of semantic network, such functions can be implemented.
In this article, I will discuss different types of lexical co-occurrences and examine the requirements for representing them in a reusable lexical resource. I will focus the discussion on the delimitation of a limited set of descriptive parameters rather than on an exhaustive classification of idioms or multiword units. Descriptive parameters will be derived from a detailed discussion of the problem of how to determine adequate translations for such units. Criteria for determining translation equivalences between multiword units of two languages will be: the syntactic and the semantic structure as well as functional, pragmatic, and stylistic properties.
This essay introduces the first linguistic task of the DELIS project': to undertake a corpus-based examination of the syntactic and semantic properties of perception vocabulary in five languages, English, Danish, Dutch, French and Italian. The theoretical background is Fillmore's Frame Semantics. The paper reviews some of the variety of facts to be accounted for, particularly in the specialization of sense associated with some collocations, and the pervasive phenomenon of Intensionality. Through this review, we aim to focus our understanding of cross-linguistic variation in this one domain, both by noting specific differences in word-sense correlation, and by exhibiting a general means of representation.
In this paper, we introduce the methodology for the construction of dictionary fragments under development in DELIS. The approach advocated is corpus-based, computationally supported, and aimed at the construction of parallel monolingual dictionary fragments which can be linked to form translation dictionaries without many problems.
The parallelism of the monolingual fragments is achieved through the use of a shared inventory of descriptive devices, one common representation formalism (typed feature structures) for linguistic information from all levels, as well as a working methodology inspired by onomasiology: treating all elements of a given lexical semantic field consistently with common descriptive devices at the same time.
It is claimed that such monolingual dictionaries are particularly easy to relate in a machine translation application. The principles of such a combination of dictionary fragments are illustrated with examples from an experimental HPSG-based interlingua-oriented machine translation prototype.