Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec

Authors
Kim, DonghwaSeo, DeokseongCho, SuhyounKang, Pilsung
Issue Date
3월-2019
Publisher
ELSEVIER SCIENCE INC
Keywords
Document classification; Semi-supervised learning; TF-IDF; LDA; Doc2vec; Co-training
Citation
INFORMATION SCIENCES, v.477, pp.15 - 29
Indexed
SCIE
SCOPUS
Journal Title
INFORMATION SCIENCES
Volume
477
Start Page
15
End Page
29
URI
https://scholar.korea.ac.kr/handle/2021.sw.korea/67193
DOI
10.1016/j.ins.2018.10.006
ISSN
0020-0255
Abstract
The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency-inverse document frequency (TF-IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions. (C) 2018 Elsevier Inc. All rights reserved.
Files in This Item
There are no files associated with this item.
Appears in
Collections
College of Engineering > School of Industrial and Management Engineering > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kang, Pil sung photo

Kang, Pil sung
공과대학 (산업경영공학부)
Read more

Altmetrics

Total Views & Downloads

BROWSE