Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Kim, Gyeongmin | - |
dc.contributor.author | Son, Junyoung | - |
dc.contributor.author | Kim, Jinsung | - |
dc.contributor.author | Lee, Hyunhee | - |
dc.contributor.author | Lim, Heuiseok | - |
dc.date.accessioned | 2022-03-12T03:40:55Z | - |
dc.date.available | 2022-03-12T03:40:55Z | - |
dc.date.created | 2022-01-20 | - |
dc.date.issued | 2021 | - |
dc.identifier.issn | 2169-3536 | - |
dc.identifier.uri | https://scholar.korea.ac.kr/handle/2021.sw.korea/138674 | - |
dc.description.abstract | Tokenization is a significant primary step for the training of the Pre-trained Language Model (PLM), which alleviates the challenging Out-of-Vocabulary problem in the area of Natural Language Processing. As tokenization strategies can change linguistic understanding, it is essential to consider the composition of input features based on the characteristics of the language for model performance. This study answers the question of Which tokenization strategy enhances the characteristics of the Korean language for the Named Entity Recognition (NER) task based on a language model? focusing on tokenization, which significantly affects the quality of input features. We present two significant challenges for the NER task with the agglutinative characteristics in the Korean language. Next, we quantitatively and qualitatively analyze the coping process of each tokenization strategy for these challenges. By adopting various linguistic segmentation such as morpheme, syllable and subcharacter, we demonstrate the effectiveness and prove the performance between PLMs based on each tokenization strategy. We validate that the most consistent strategy for the challenges of the Korean language is a syllable based on Sentencepiece. | - |
dc.language | English | - |
dc.language.iso | en | - |
dc.publisher | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC | - |
dc.subject | REPRESENTATION | - |
dc.title | Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies | - |
dc.type | Article | - |
dc.contributor.affiliatedAuthor | Lim, Heuiseok | - |
dc.identifier.doi | 10.1109/ACCESS.2021.3126882 | - |
dc.identifier.scopusid | 2-s2.0-85119719307 | - |
dc.identifier.wosid | 000719560000001 | - |
dc.identifier.bibliographicCitation | IEEE ACCESS, v.9, pp.151814 - 151823 | - |
dc.relation.isPartOf | IEEE ACCESS | - |
dc.citation.title | IEEE ACCESS | - |
dc.citation.volume | 9 | - |
dc.citation.startPage | 151814 | - |
dc.citation.endPage | 151823 | - |
dc.type.rims | ART | - |
dc.type.docType | Article | - |
dc.description.journalClass | 1 | - |
dc.description.journalRegisteredClass | scie | - |
dc.description.journalRegisteredClass | scopus | - |
dc.relation.journalResearchArea | Computer Science | - |
dc.relation.journalResearchArea | Engineering | - |
dc.relation.journalResearchArea | Telecommunications | - |
dc.relation.journalWebOfScienceCategory | Computer Science, Information Systems | - |
dc.relation.journalWebOfScienceCategory | Engineering, Electrical & Electronic | - |
dc.relation.journalWebOfScienceCategory | Telecommunications | - |
dc.subject.keywordPlus | REPRESENTATION | - |
dc.subject.keywordAuthor | Hidden Markov models | - |
dc.subject.keywordAuthor | Korean pre-trained language model | - |
dc.subject.keywordAuthor | Linguistics | - |
dc.subject.keywordAuthor | Named entity recognition | - |
dc.subject.keywordAuthor | Semantics | - |
dc.subject.keywordAuthor | Solid modeling | - |
dc.subject.keywordAuthor | Syntactics | - |
dc.subject.keywordAuthor | Task analysis | - |
dc.subject.keywordAuthor | Tokenization | - |
dc.subject.keywordAuthor | agglutinative language | - |
dc.subject.keywordAuthor | linguistic segmentation | - |
dc.subject.keywordAuthor | natural language processing | - |
dc.subject.keywordAuthor | tokenization | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
(02841) 서울특별시 성북구 안암로 14502-3290-1114
COPYRIGHT © 2021 Korea University. All Rights Reserved.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.