المستخلص: |
The purpose of information retrieval (IR) is to find all documents relevant for a user’s query in a collection of documents. The central task in Natural Language Processing (NLP) for JR is the transformation of potentially ambiguous natural language queries and documents into unambiguous internal representations on which matching and retrieval can take place. Many levels of NLP can be used for this purpose: morphological, lexical, syntactic and semantic analysis. The LIC2M cross-language information retrieval system is a weighted Boolean search engine over syntactic structures produced by a linguistic analysis of the query and the documents. The system is composed of a linguistic analyzer, a statistic analyzer, a reformulator, an indexer, a comparator and a search engine. This system is designed to work on Arabic, Chinese, English, French, German and Spanish. Arabic is highly productive, both derivationally and inflectionally. Definite articles, conjunctions, particles and other prefixes can attach to the beginning of a word, and large numbers of suffixes can attach to the end. Moreover, newspaper Arabic texts are often completely or partially vowelled and an unvowelled word can correspond to a set of potentially vowelled words having different meanings. For information retrieval, this abundance of forms, lexical variability, and orthographic alternatives, all result in a greater likelihood of mismatch between the form of a word in a query and the forms found in documents relevant to the query. To improve retrieval effectiveness of any Arabic information retrieval system, specific processing for vowellation and stemming is required. In this paper we present an Arabic linguistic analyzer used in a cross- lingual information retrieval application. We will particularly focus on morphological module and linguistic resources used in the different analysis levels.
|