Combination of Term Weighting with Class Distribution and Centroid-based Approach for Document Classification

Aditya, Christian Sri Kusuma and Sumadi, Fauzi Dwi Setiawan (2023) Combination of Term Weighting with Class Distribution and Centroid-based Approach for Document Classification. Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, 8 (4). pp. 781-788. ISSN 2503-2267

[thumbnail of Lampiran B.12 - Aditya Sumadi - Term Weighting TF-IDF ICF Term Distribution Centroid Text.pdf]
Preview
Text
Lampiran B.12 - Aditya Sumadi - Term Weighting TF-IDF ICF Term Distribution Centroid Text.pdf

Download (386kB) | Preview
[thumbnail of Lampiran B.12 - Similarity - Aditya Sumadi - Term Weighting TF-IDF ICF Term Distribution Centroid Text.pdf]
Preview
Text
Lampiran B.12 - Similarity - Aditya Sumadi - Term Weighting TF-IDF ICF Term Distribution Centroid Text.pdf

Download (2MB) | Preview

Abstract

A text retrieval system requires a method that is able to return a number of documents with high relevance upon user requests. One of the important stages in the text representation process is the weighting process. The use of Term Frequency (TF) considers the number of word occurrences in each document, while Inverse Document Frequency (IDF) considers the wide distribution of words throughout the document collection. However, the TF-IDF weighting cannot represent the distribution of words to documents with many classes or categories. The more unequal the distribution of words in each category, the more important the word features should be. This study developed a new term weighting method where weighting is carried out based on the frequency of occurrence of terms in each class which is integrated with the distribution of centroid-based terms which can minimize intra-cluster similarity and maximize inter-cluster variance. The ICF.TDCB term weighting method has been able to provide the best results in its application to SVM modeling with a dataset of 931 online news documents. The results show that SVM modeling had accuracy of 0.723, outperforming the use of other term weightings such as TF.IDF, ICF & TDCB.

Item Type: Article
Keywords: Term Weighting; TF-IDF; ICF; Term Distribution; Centroid; Text
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions: Faculty of Engineering > Department of Informatics (55201)
Depositing User: christianskaditya Christian Sri Kusuma Aditya, S.Kom., M.Kom
Date Deposited: 29 Apr 2024 04:38
Last Modified: 29 Apr 2024 04:38
URI: https://eprints.umm.ac.id/id/eprint/5931

Actions (login required)

View Item
View Item