This paper presents MASRAD, a terminology dataset for Arabic terminology management, and a method with supporting tools for its semi-automatic construction. The entries in MASRAD are $(f,a)$ pairs of foreign (non-Arabic) terms $f$, appearing in specialized, academic and field-specific books next to their Arabic $a$ counterparts. MASRAD-Ex systematically extracts these pairs as a first step to construct MASRAD. MASRAD helps improving term consistency in academic translations and specialized Arabic documents, and automating cross-lingual text processing. MASRAD-Ex leverages translated terms organically occurring in Arabic books, and considers several candidate pairs for each term phrase. The candidate Arabic terms occur next to the foreign terms, and vary in length. MASRAD-Ex computes lexicographic, phonetic, morphological, and semantic similarity metrics for each candidate pair, and uses heuristic, machine learning, and machine learning with post-processing approaches to decide on the best candidate. This paper presents MASRAD after thorough expert review and makes it available to the interested research community. The best performing MASRAD-Ex approach achieved 90.5% precision and 92.4% recall.
翻译:暂无翻译