Sound event detection by pseudo-labeling in weakly labeled dataset

Park C.; Kim D.; Ko H.

doi:10.3390/s21248375

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Sound event detection by pseudo-labeling in weakly labeled dataset

Authors: Park C.; Kim D.; Ko H.

Issue Date: 12월-2021

Publisher: MDPI

Keywords: Dilated convolution; Gated linear unit (GLU); Noise label; Noise loss; Segmentation mask; Weakly labeled sound event detection (WSED)

Citation: Sensors, v.21, no.24

Indexed: SCOPUS

Journal Title: Sensors

Volume: 21

Number: 24

URI: https://scholar.korea.ac.kr/handle/2021.sw.korea/135510

DOI: 10.3390/s21248375

ISSN: 1424-8220

Abstract: Weakly labeled sound event detection (WSED) is an important task as it can facilitate the data collection efforts before constructing a strongly labeled sound event dataset. Recent high performance in deep learning-based WSED’s exploited using a segmentation mask for detecting the target feature map. However, achieving accurate detection performance was limited in real streaming audio due to the following reasons. First, the convolutional neural networks (CNN) employed in the segmentation mask extraction process do not appropriately highlight the importance of feature as the feature is extracted without pooling operations, and, concurrently, a small size kernel forces the receptive field small, making it difficult to learn various patterns. Second, as feature maps are obtained in an end-to-end fashion, the WSED model would be weak to unknown contents in the wild. These limitations would lead to generating undesired feature maps, such as noise in the unseen environment. This paper addresses these issues by constructing a more efficient model by employing a gated linear unit (GLU) and dilated convolution to improve the problems of de-emphasizing importance and lack of receptive field. In addition, this paper proposes pseudo-label-based learning for classifying target contents and unknown contents by adding ’noise label’ and ’noise loss’ so that unknown contents can be separated as much as possible through the noise label. The experiment is performed by mixing DCASE 2018 task1 acoustic scene data and task2 sound event data. The experimental results show that the proposed SED model achieves the best F1 performance with 59.7% at 0 SNR, 64.5% at 10 SNR, and 65.9% at 20 SNR. These results represent an improvement of 17.7%, 16.9%, and 16.5%, respectively, over the baseline.

Files in This Item: There are no files associated with this item.

Appears in Collections: College of Engineering > School of Electrical Engineering > 1. Journal Articles

Show full item record

qrcode

Altmetrics

Total Views & Downloads

STATISTICS: Total View :8,263,314; Today View :6,404

RSS_1.0 RSS_2.0 ATOM_1.0

(02841) 서울특별시 성북구 안암로 14502-3290-1114

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Altmetrics

Total Views & Downloads

BROWSE