520 |
|
|
|b With the present effort, we propose to investigate results of applying the Right- Truncated Index-Based Web Search Engine in order to determine its usefulness for storing and retrieving Arabic documents. The Right-Truncated Index-Based Web Search Engine, being a program for reading any set of Arabic documents, accepts a query, and then processes both the documents and the query. Thus, it selects (predicts) those documents most relevant to the query which was inserted. The program encompasses both a morphological component and a mathematical one. The morphological component allows the researcher to run either a stemming algorithm or a right-truncated algorithm. The chief advantage of the stemming algorithm is that it uses the least possible amount of storage for indexing by mapping the inflected and derived terms into a single, indexed stem-word. On the other hand, the right-truncated algorithm reduces the amount of storage to a lesser degree, but increases the probability of retrieving relevant (user-favorable) documents, compared to the stemming algorithm. One of the purposes of our investigation is to compare the efficiency of these two indexing mechanisms. The mathematical component of the algorithm accepts the output of the right truncation algorithm, and then employs both term-frequency and inverse document- frequency (TF-IDF) in order to establish the relative importance of each document, respective to the terms of the query. This component computes the TF-IDF (term-weighting scheme) by multiplying the inverse document frequency-array with the term frequency-array for each term contained in every document. Then, it computes the cosine-similarity shared by the query-vector and each individual document-vector in the collection. The greater the cosine-similarity between the query-vector and the document-vector, the greater the relevancy the document presents to the query. Expressed differently, the greater the cosine-similarity between the terms of the query and the document which contains those terms, the higher the probability that said document will correspond to user- interest, thereby improving the query's power to retrieve.
|