Abstract
This paper defines a Standard Arabic Profiling (SAP) toolset that helps researchers for textual analysis and comparing between different Arabic corpora. Since tools for Arabic language are needed, we present the SAP toolset to simplify the textual analysis process. The approach consists of three profilers: The Part of Speech (POS) profiler that gives statistical analysis for a given document, vocabulary profiler which provides user with an indication out the vocabulary used in a document with reference to Open Source Arabic Corpus (OSAC) of two news agencies (CNN and BBC). The process is accomplished by computing similarity between documents and corpus using Log likelihood measure. Lastly the newly added profiler is the Readability profiler which is used to 1) assess the readability level for a document according to Flesch Reading Ease Readability Formula, and 2) measure the simplicity and ambiguity levels of the document. We described the current part-of-speech for this toolset and how we can extend its functionality to embrace vocabulary and readability profiling.
Original language | English (US) |
---|---|
Pages (from-to) | 222-229 |
Number of pages | 8 |
Journal | International Journal of Machine Learning and Computing |
Volume | 9 |
Issue number | 2 |
DOIs | |
State | Published - Apr 1 2019 |
Externally published | Yes |
Keywords
- Part-of-speech tagging (POST)
- Software
- Terms-Arabic natural language processing
- Text analysis
ASJC Scopus subject areas
- Computer Science Applications
- Information Systems and Management
- Artificial Intelligence