Discover Underlying Topics in Thai News Articles: A Comparative Study of Probabilistic and Matrix Factorization Approaches

364

Views

0

Downloads

Pitichotchokphokhin, Pimpitcha, Chuangkrud, Piyawat, Kalakan, Kongkan, Suntisrivaraporn, Boontawee, Leelanupab, Teerapong and Kanungsukkasem, Nont (2020) Discover Underlying Topics in Thai News Articles: A Comparative Study of Probabilistic and Matrix Factorization Approaches In: 2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2020-06-24, Phuket, Thailand.

Abstract

Topic modeling is an unsupervised learning approach, which can automatically discover the hidden thematic structure in text documents. For text mining, topic modeling is a language-independent technique that disregards grammar and word order. Apart from semantic and structural issues, Thai language is typically considered more complex than others. Due to the lack of word delimiter and a surfeit of composite words. Errors from word tokenization can create significant problems for any post processes of text, such as document retrieval, sentiment analysis, machine translation, etc., adversely decreasing the performance of text applications. Despite a strong correlation between word ordering and semantic meaning, topic modeling has been widely reported that it can extract latent information, aka. latent topic or latent semantic, encoded in documents. Although there were few previous research works on studying topic modeling in Thai language, they mostly focused on upstream processes of Natural Language Processing (NLP) in, for example, applying a refined stop-word list to, or adding N-gram on a single specific topic modeling method. To our knowledge, this paper is the first to explore different topic modeling approaches, i.e., Latent Dirichlet Allocation (LDA) and Nonnegative Metrix Factorization (NMF), in Thai Language to compare their coherence. We also employ and compare a set of state-of-the-art evaluation metrics based on Topic Coherence.

Item Type:

Conference or Workshop Item (Paper)

Identification Number (DOI):

Deposited by:

ระบบ อัตโนมัติ

Date Deposited:

2021-09-09 23:53:47

Last Modified:

2021-09-29 22:33:03

Impact and Interest:

Statistics