Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Kim, Donghwa | - |
dc.contributor.author | Seo, Deokseong | - |
dc.contributor.author | Cho, Suhyoun | - |
dc.contributor.author | Kang, Pilsung | - |
dc.date.accessioned | 2021-09-01T18:14:31Z | - |
dc.date.available | 2021-09-01T18:14:31Z | - |
dc.date.created | 2021-06-19 | - |
dc.date.issued | 2019-03 | - |
dc.identifier.issn | 0020-0255 | - |
dc.identifier.uri | https://scholar.korea.ac.kr/handle/2021.sw.korea/67193 | - |
dc.description.abstract | The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency-inverse document frequency (TF-IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions. (C) 2018 Elsevier Inc. All rights reserved. | - |
dc.language | English | - |
dc.language.iso | en | - |
dc.publisher | ELSEVIER SCIENCE INC | - |
dc.subject | TEXT | - |
dc.subject | FREQUENCY | - |
dc.title | Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec | - |
dc.type | Article | - |
dc.contributor.affiliatedAuthor | Kang, Pilsung | - |
dc.identifier.doi | 10.1016/j.ins.2018.10.006 | - |
dc.identifier.scopusid | 2-s2.0-85055422604 | - |
dc.identifier.wosid | 000456763600002 | - |
dc.identifier.bibliographicCitation | INFORMATION SCIENCES, v.477, pp.15 - 29 | - |
dc.relation.isPartOf | INFORMATION SCIENCES | - |
dc.citation.title | INFORMATION SCIENCES | - |
dc.citation.volume | 477 | - |
dc.citation.startPage | 15 | - |
dc.citation.endPage | 29 | - |
dc.type.rims | ART | - |
dc.type.docType | Article | - |
dc.description.journalClass | 1 | - |
dc.description.journalRegisteredClass | scie | - |
dc.description.journalRegisteredClass | scopus | - |
dc.relation.journalResearchArea | Computer Science | - |
dc.relation.journalWebOfScienceCategory | Computer Science, Information Systems | - |
dc.subject.keywordPlus | TEXT | - |
dc.subject.keywordPlus | FREQUENCY | - |
dc.subject.keywordAuthor | Document classification | - |
dc.subject.keywordAuthor | Semi-supervised learning | - |
dc.subject.keywordAuthor | TF-IDF | - |
dc.subject.keywordAuthor | LDA | - |
dc.subject.keywordAuthor | Doc2vec | - |
dc.subject.keywordAuthor | Co-training | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
145 Anam-ro, Seongbuk-gu, Seoul, 02841, Korea+82-2-3290-2963
COPYRIGHT © 2021 Korea University. All Rights Reserved.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.