When are Latent Topics Useful for Text Mining?: Enriching Bag-of-Words Representations with Information Extraction in Thai News Articles

24

Views

0

Downloads

Kanungsukkasem, Nont, Chuangkrud, Piyawat, Pitichotchokphokhin, Pimpitcha, Damrongrat, Chaianun and Leelanupab, Teerapong (2023) When are Latent Topics Useful for Text Mining?: Enriching Bag-of-Words Representations with Information Extraction in Thai News Articles In: 15th Asian Conference on Intelligent Information and Database Systems (ACIIDS 2023), Thailand.

Abstract

The Bag-of-Words (BOW) model is simple but one of the successful representations of text documents. This model, however, suffers from the sparse matrix, in which most of the elements are zero. Topic modeling is an unsupervised learning method that can represent text documents in a low-dimensional space. Latent Dirichlet Allocation (LDA) is a topic modeling technique used for topic extraction and data exploration, with interpretable output. This paper presents a thorough study of potential benefits of applying LDA, as a feature extraction, to topic discovery and document classification in Thai news articles, comparing with TF–IDF and Word2Vec. We also studied how much of the top Thai terms extracted from LDA with the different numbers of topics can be interpretable and meaningful, and can be a representative of the corpus. Besides, a set of Topic Coherence measures were included in our study to estimate the degree of semantic similarity of extracted topics. To compare the performance and optimization time of classification of features from the different feature extraction methods, various types of classifiers, e.g., Logistic Regression, Random Forest, XGBoosting, etc., were experimented.

Item Type:

Conference or Workshop Item (Speech)

Subjects:

Subjects > Computer Science > Artificial Intelligence

Subjects > Computer Science > Machine Learning

Subjects > Computer Science > Computation and Language (Computational Linguistics and Natural Language and Speech Processing)

Deposited by:

Nont Kanungsukkasem

Date Deposited:

2024-11-19 12:26:31

Last Modified:

2024-12-02 11:54:19

Impact and Interest:

Statistics