ارسل ملاحظاتك

ارسل ملاحظاتك لنا









Arabic Text Classification Using Dynamic N-Gram

العنوان بلغة أخرى: تصنيف النصوص العربية باستخدام الانغرام المتغير
المؤلف الرئيسي: Al Omoush, Safaa Qasim (Author)
مؤلفين آخرين: Samawi, Venus W. (Advisor)
التاريخ الميلادي: 2013
موقع: المفرق
الصفحات: 1 - 52
رقم MD: 819023
نوع المحتوى: رسائل جامعية
اللغة: الإنجليزية
الدرجة العلمية: رسالة ماجستير
الجامعة: جامعة آل البيت
الكلية: كلية الأمير الحسين بن عبد الله لتكنولوجيا المعلومات
الدولة: الاردن
قواعد المعلومات: Dissertations
مواضيع:
رابط المحتوى:
صورة الغلاف QR قانون

عدد مرات التحميل

4

حفظ في:
المستخلص: N-gram is defined as a subsequence of N items from a given sequence. In case of noisy text problem, N-gram is the ideal solution. Therefore, we are interested in using N-gram to represent text documents. In the literature, N-gram refers sometimes to sequences that are not ordered or consecutive. In this thesis, an N-gram will refer to a chain of N consecutive characters. Few researches used N as static value for Arabic text classification and information retrieval purposes. In static N-gram, the text will be segmented to create N-grams with the same length (value of N) such as 3, 4, 5...etc. The problem of this type of text representation is that, if there is a word or stem with letters less than N character, it will be neglected and considered as a useless word. For example, if N=4 then all the words which have fewer letters than 4 will be neglected. Our work is concerned with developing an automated system for classifying Arabic text documents by using N-gram as text representation. We have suggested dynamic N-gram, where N will be determined dynamically (based on word length) to reduce the common grams that may belong to totally different words. To study the performance of dynamic N-gram (weather it will improve the classification accuracy or not), both traditional static N-gram system and the suggested dynamic N-gram system have been built. The result of the two systems will be compared from accuracy, recall, precision, and F-measure point of views. F-measure is a standard statistical measure that is used to measure the performance of a classifier system. The F-measure is an average parameter based on precision and recall. Our proposed system consists of number of phases: document preprocessing, document feature extraction, construction of the classifier, and document classification. We have constructed two classifiers: Naïve Bayes (NB) classifier and Dice-measure distance classifier. Finally, in classification phase, we have evaluated the performance of our proposed system using Diab dataset, and calculated the standards evaluation measurements mentioned above. The classification results was promising (F-measure=98.87% with Dice-measure classifier). Also, it is found that the Dice-measure classifier performs better when dynamic N-gram is used.

عناصر مشابهة