Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies

Kim, Gyeongmin; Son, Junyoung; Kim, Jinsung; Lee, Hyunhee; Lim, Heuiseok

doi:10.1109/ACCESS.2021.3126882

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies

Full metadata record

DC Field	Value	Language
dc.contributor.author	Kim, Gyeongmin	-
dc.contributor.author	Son, Junyoung	-
dc.contributor.author	Kim, Jinsung	-
dc.contributor.author	Lee, Hyunhee	-
dc.contributor.author	Lim, Heuiseok	-
dc.date.accessioned	2022-03-12T03:40:55Z	-
dc.date.available	2022-03-12T03:40:55Z	-
dc.date.created	2022-01-20	-
dc.date.issued	2021	-
dc.identifier.issn	2169-3536	-
dc.identifier.uri	https://scholar.korea.ac.kr/handle/2021.sw.korea/138674	-
dc.description.abstract	Tokenization is a significant primary step for the training of the Pre-trained Language Model (PLM), which alleviates the challenging Out-of-Vocabulary problem in the area of Natural Language Processing. As tokenization strategies can change linguistic understanding, it is essential to consider the composition of input features based on the characteristics of the language for model performance. This study answers the question of Which tokenization strategy enhances the characteristics of the Korean language for the Named Entity Recognition (NER) task based on a language model? focusing on tokenization, which significantly affects the quality of input features. We present two significant challenges for the NER task with the agglutinative characteristics in the Korean language. Next, we quantitatively and qualitatively analyze the coping process of each tokenization strategy for these challenges. By adopting various linguistic segmentation such as morpheme, syllable and subcharacter, we demonstrate the effectiveness and prove the performance between PLMs based on each tokenization strategy. We validate that the most consistent strategy for the challenges of the Korean language is a syllable based on Sentencepiece.	-
dc.language	English	-
dc.language.iso	en	-
dc.publisher	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC	-
dc.subject	REPRESENTATION	-
dc.title	Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies	-
dc.type	Article	-
dc.contributor.affiliatedAuthor	Lim, Heuiseok	-
dc.identifier.doi	10.1109/ACCESS.2021.3126882	-
dc.identifier.scopusid	2-s2.0-85119719307	-
dc.identifier.wosid	000719560000001	-
dc.identifier.bibliographicCitation	IEEE ACCESS, v.9, pp.151814 - 151823	-
dc.relation.isPartOf	IEEE ACCESS	-
dc.citation.title	IEEE ACCESS	-
dc.citation.volume	9	-
dc.citation.startPage	151814	-
dc.citation.endPage	151823	-
dc.type.rims	ART	-
dc.type.docType	Article	-
dc.description.journalClass	1	-
dc.description.journalRegisteredClass	scie	-
dc.description.journalRegisteredClass	scopus	-
dc.relation.journalResearchArea	Computer Science	-
dc.relation.journalResearchArea	Engineering	-
dc.relation.journalResearchArea	Telecommunications	-
dc.relation.journalWebOfScienceCategory	Computer Science, Information Systems	-
dc.relation.journalWebOfScienceCategory	Engineering, Electrical & Electronic	-
dc.relation.journalWebOfScienceCategory	Telecommunications	-
dc.subject.keywordPlus	REPRESENTATION	-
dc.subject.keywordAuthor	Hidden Markov models	-
dc.subject.keywordAuthor	Korean pre-trained language model	-
dc.subject.keywordAuthor	Linguistics	-
dc.subject.keywordAuthor	Named entity recognition	-
dc.subject.keywordAuthor	Semantics	-
dc.subject.keywordAuthor	Solid modeling	-
dc.subject.keywordAuthor	Syntactics	-
dc.subject.keywordAuthor	Task analysis	-
dc.subject.keywordAuthor	Tokenization	-
dc.subject.keywordAuthor	agglutinative language	-
dc.subject.keywordAuthor	linguistic segmentation	-
dc.subject.keywordAuthor	natural language processing	-
dc.subject.keywordAuthor	tokenization	-

Files in This Item: There are no files associated with this item.

Appears in Collections: Graduate School > Department of Computer Science and Engineering > 1. Journal Articles

Show simple item record

qrcode

Altmetrics

Total Views & Downloads

STATISTICS: Total View :8,374,376; Today View :4,022

RSS_1.0 RSS_2.0 ATOM_1.0

(02841) 서울특별시 성북구 안암로 14502-3290-1114

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Altmetrics

Total Views & Downloads

BROWSE