A DATA-DRIVEN TEXT SIMILARITY MEASURE BASED ON CLASSIFICATION ALGORITHMS
- Authors
- Cho, Su Gon; Kim, Seoung Bum
- Issue Date
- 2017
- Publisher
- UNIV CINCINNATI INDUSTRIAL ENGINEERING
- Keywords
- classification; sentence-term matrix; text similarity measure; text mining
- Citation
- INTERNATIONAL JOURNAL OF INDUSTRIAL ENGINEERING-THEORY APPLICATIONS AND PRACTICE, v.24, no.3, pp.328 - 339
- Indexed
- SCIE
SCOPUS
- Journal Title
- INTERNATIONAL JOURNAL OF INDUSTRIAL ENGINEERING-THEORY APPLICATIONS AND PRACTICE
- Volume
- 24
- Number
- 3
- Start Page
- 328
- End Page
- 339
- URI
- https://scholar.korea.ac.kr/handle/2021.sw.korea/86287
- ISSN
- 1072-4761
- Abstract
- Measuring text similarity has shown its fundamental utilization in various text mining application problems. This paper proposes a new method based on classification algorithms for measuring the similarity between two texts. Specifically, a sentence-term matrix that describes the frequency of terms that occur in a collection of sentences was created to measure the classification accuracy of two texts. Our idea is based on the fact that similar texts are difficult to distinguish from each other, which should lead to a low classification accuracy between similar texts. By doing comparative experiments on several widely used text similarity measures, analysis results with real data from the Machine Learning Repository at the University of California, Irvine demonstrate that the proposed method is able to achieve outperformed the other existing similarity measures across the entire range of term selection filters.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - College of Engineering > School of Industrial and Management Engineering > 1. Journal Articles
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.