Comparative Study among Term Frequency-Inverse Document Frequency and Count Vectorizer towards K Nearest Neighbor and Decision Tree Classifiers for Text Dataset

Tula Kanta Deo; Rajesh Keshavrao Deshmukh; Gajendra Sharma

doi:10.3126/njmr.v7i2.68189

Authors

Tula Kanta Deo Kalinga University, Raipur(CG), India
Rajesh Keshavrao Deshmukh Kalinga University, Raipur(CG), India
Gajendra Sharma Kathmandu University, Nepal

DOI:

https://doi.org/10.3126/njmr.v7i2.68189

Keywords:

Count Vectorizer, Decision Tree, K Nearest Neighbor, Term Frequency and Inverse Document Frequency

Abstract

Background: Text classification techniques are increasingly important with the exponential growth of textual data on the internet. Term Frequency-Inverse Document Frequency (TF-IDF) and Count Vectorizer(CV) are commonly used methods for feature extraction. TF-IDF assigning weights to terms based on their frequency. CV simply counts the occurrences of terms. The performance of CV as well as TF-IDF are evaluated and compared with KNN and DT classifiers across text datasets.

Methodology: The investigation begins with preprocessing. The feature vectors are created using both TF-IDF and CV. Feature vectors are passed into the KNN and DT classifiers at in training stage. Experiments are executed the usage of Kaggle's public database Ukraine 10K tweets sentiment_analysis dataset and the Womens ecommerce clothing reviews dataset.

Findings: The average of precision, recall, f1 score and accuracy of KNN with TF-IDF were 84.5%, 87%, 83%, 87% respectively and KNN with CV were 83.5%, 87%, 83.5%, 87% respectively. Similarly, average of precision, recall, f1 score and accuracy of DT with TF-IDF were 89%, 89%, 89%, 89% respectively and DT with CV were 89%, 89.5%, 89.5%, 89.5% respectively. The results obtained in this research is consistent with previous similar research result.

Conclusions: The performance of TF-IDF is almost similar as CV for a particular dataset and a particular classifier in this study.

Novelty: The experiment performed using these classifiers and feature extraction methods on the datasets is a novelty and contribution of this research.

Downloads

Download data is not yet available.

Abstract

243

PDF

187

Author Biographies

Tula Kanta Deo, Kalinga University, Raipur(CG), India

Department of Computer Science and Engineering

Rajesh Keshavrao Deshmukh, Kalinga University, Raipur(CG), India

Department of Computer Science and Engineering

Gajendra Sharma, Kathmandu University, Nepal

Department of Computer Science and Engineering

Comparative Study among Term Frequency-Inverse Document Frequency and Count Vectorizer towards K Nearest Neighbor and Decision Tree Classifiers for Text Dataset

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biographies

Tula Kanta Deo, Kalinga University, Raipur(CG), India

Rajesh Keshavrao Deshmukh, Kalinga University, Raipur(CG), India

Gajendra Sharma, Kathmandu University, Nepal

Downloads

Published

How to Cite

Issue

Section

License

Information

Current Issue