Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Silent-PIM: Realizing the Processing-in-Memory Computing With Standard Memory Requests

Authors
Kim, Chang HyunLee, Won JunPaik, YoonahKwon, KiyongKim, Seok YoungPark, IlKim, Seon Wook
Issue Date
1-2월-2022
Publisher
IEEE COMPUTER SOC
Keywords
Silent-PIM; in-memory processing; standard memory requests; DMA; LSTM
Citation
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, v.33, no.2, pp.251 - 262
Indexed
SCIE
SCOPUS
Journal Title
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
Volume
33
Number
2
Start Page
251
End Page
262
URI
https://scholar.korea.ac.kr/handle/2021.sw.korea/135224
DOI
10.1109/TPDS.2021.3065365
ISSN
1045-9219
Abstract
The Deep Neural Network (DNN), Recurrent Neural Network (RNN) applications, rapidly becoming attractive to the market, process a large amount of low-locality data; thus, the memory bandwidth limits their peak performance. Therefore, many data centers actively adapt high-bandwidth memory like HBM2/HBM2E to resolve the problem. However, this approach would not provide a complete solution since it still transfers the data from the memory to the computing unit. Thus, processing-in-memory (PIM), which performs the computation inside memory, has attracted attention. However, most previous methods require the modification or the extension of core pipelines and memory system components like memory controllers, making the practical implementation of PIM very challenging and expensive in development. In this article, we propose a Silent-PIM that performs the PIM computation with standard DRAM memory requests; thus, requiring no hardware modifications and allowing the PIM memory device to perform the computation while servicing non-PIM applications' memory requests. We can achieve our design goal by preserving the standard memory request behaviors and satisfying the DRAM standard timing requirements. In addition, using standard memory requests makes it possible to use DMA as a PIM's offloading engine, resulting in processing the PIM memory requests fast and making a core perform other tasks. We compared the performance of three Long Short-Term Memory models (LSTM) kernels on real platforms, such as the Silent-PIM modeled on the FPGA, GPU, and CPU. For (p x 512) x (512 x 2048) matrix multiplication with a batch size p varying from 1 to 128, the Silent-PIM performed up to 16.9x and 24.6x faster than GPU and CPU, respectively, p = 1, which was the case without having any data reuse. At p = 128, the highest data reuse case, the GPU performance was the highest, but the PIM performance was still higher than the CPU execution. Similarly, at (p x 2048) element-wise multiplication and addition, where there was no data reuse, the Silent-PIM always achieved higher than both CPU and GPU. It also showed that when the PIM's EDP performance was superior to the others in all the cases having no data reuse.
Files in This Item
There are no files associated with this item.
Appears in
Collections
College of Engineering > School of Electrical Engineering > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kim, Seon Wook photo

Kim, Seon Wook
공과대학 (전기전자공학부)
Read more

Altmetrics

Total Views & Downloads

BROWSE