Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec

Kim, Donghwa; Seo, Deokseong; Cho, Suhyoun; Kang, Pilsung

doi:10.1016/j.ins.2018.10.006

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec

Full metadata record

DC Field	Value	Language
dc.contributor.author	Kim, Donghwa	-
dc.contributor.author	Seo, Deokseong	-
dc.contributor.author	Cho, Suhyoun	-
dc.contributor.author	Kang, Pilsung	-
dc.date.accessioned	2021-09-01T18:14:31Z	-
dc.date.available	2021-09-01T18:14:31Z	-
dc.date.created	2021-06-19	-
dc.date.issued	2019-03	-
dc.identifier.issn	0020-0255	-
dc.identifier.uri	https://scholar.korea.ac.kr/handle/2021.sw.korea/67193	-
dc.description.abstract	The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency-inverse document frequency (TF-IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions. (C) 2018 Elsevier Inc. All rights reserved.	-
dc.language	English	-
dc.language.iso	en	-
dc.publisher	ELSEVIER SCIENCE INC	-
dc.subject	TEXT	-
dc.subject	FREQUENCY	-
dc.title	Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec	-
dc.type	Article	-
dc.contributor.affiliatedAuthor	Kang, Pilsung	-
dc.identifier.doi	10.1016/j.ins.2018.10.006	-
dc.identifier.scopusid	2-s2.0-85055422604	-
dc.identifier.wosid	000456763600002	-
dc.identifier.bibliographicCitation	INFORMATION SCIENCES, v.477, pp.15 - 29	-
dc.relation.isPartOf	INFORMATION SCIENCES	-
dc.citation.title	INFORMATION SCIENCES	-
dc.citation.volume	477	-
dc.citation.startPage	15	-
dc.citation.endPage	29	-
dc.type.rims	ART	-
dc.type.docType	Article	-
dc.description.journalClass	1	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Computer Science	-
dc.relation.journalWebOfScienceCategory	Computer Science, Information Systems	-
dc.subject.keywordPlus	TEXT	-
dc.subject.keywordPlus	FREQUENCY	-
dc.subject.keywordAuthor	Document classification	-
dc.subject.keywordAuthor	Semi-supervised learning	-
dc.subject.keywordAuthor	TF-IDF	-
dc.subject.keywordAuthor	LDA	-
dc.subject.keywordAuthor	Doc2vec	-
dc.subject.keywordAuthor	Co-training	-

Files in This Item: There are no files associated with this item.

Appears in Collections: College of Engineering > School of Industrial and Management Engineering > 1. Journal Articles

Show simple item record

qrcode

Related Researcher

Researcher Kang, Pil sung photo

Kang, Pil sung: 공과대학 (School of Industrial and Management Engineering)

Read more

Altmetrics

Total Views & Downloads

STATISTICS: Total View :6,900,440; Today View :49

RSS_1.0 RSS_2.0 ATOM_1.0

145 Anam-ro, Seongbuk-gu, Seoul, 02841, Korea+82-2-3290-2963

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE