Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Cho, Hyeongmin | - |
dc.contributor.author | Lee, Sangkyun | - |
dc.date.accessioned | 2021-08-30T05:04:39Z | - |
dc.date.available | 2021-08-30T05:04:39Z | - |
dc.date.created | 2021-06-18 | - |
dc.date.issued | 2021-01 | - |
dc.identifier.issn | 2076-3417 | - |
dc.identifier.uri | https://scholar.korea.ac.kr/handle/2021.sw.korea/50624 | - |
dc.description.abstract | Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets. | - |
dc.language | English | - |
dc.language.iso | en | - |
dc.publisher | MDPI | - |
dc.title | Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data | - |
dc.type | Article | - |
dc.contributor.affiliatedAuthor | Lee, Sangkyun | - |
dc.identifier.doi | 10.3390/app11020472 | - |
dc.identifier.scopusid | 2-s2.0-85099251864 | - |
dc.identifier.wosid | 000610939300001 | - |
dc.identifier.bibliographicCitation | APPLIED SCIENCES-BASEL, v.11, no.2, pp.1 - 17 | - |
dc.relation.isPartOf | APPLIED SCIENCES-BASEL | - |
dc.citation.title | APPLIED SCIENCES-BASEL | - |
dc.citation.volume | 11 | - |
dc.citation.number | 2 | - |
dc.citation.startPage | 1 | - |
dc.citation.endPage | 17 | - |
dc.type.rims | ART | - |
dc.type.docType | Article | - |
dc.description.journalClass | 1 | - |
dc.description.journalRegisteredClass | scie | - |
dc.description.journalRegisteredClass | scopus | - |
dc.relation.journalResearchArea | Chemistry | - |
dc.relation.journalResearchArea | Engineering | - |
dc.relation.journalResearchArea | Materials Science | - |
dc.relation.journalResearchArea | Physics | - |
dc.relation.journalWebOfScienceCategory | Chemistry, Multidisciplinary | - |
dc.relation.journalWebOfScienceCategory | Engineering, Multidisciplinary | - |
dc.relation.journalWebOfScienceCategory | Materials Science, Multidisciplinary | - |
dc.relation.journalWebOfScienceCategory | Physics, Applied | - |
dc.subject.keywordAuthor | data quality | - |
dc.subject.keywordAuthor | large-scale | - |
dc.subject.keywordAuthor | high-dimensionality | - |
dc.subject.keywordAuthor | linear discriminant analysis | - |
dc.subject.keywordAuthor | random projection | - |
dc.subject.keywordAuthor | bootstrapping | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
145 Anam-ro, Seongbuk-gu, Seoul, 02841, Korea+82-2-3290-2963
COPYRIGHT © 2021 Korea University. All Rights Reserved.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.