Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Achieving the Performance of All-Bank In-DRAM PIM With Standard Memory Interface: Memory-Computation Decouplingopen access

Authors
Paik, YoonahKim, Chang HyunLee, Won JunKim, Seon Wook
Issue Date
2022
Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Keywords
Memory-computation decoupling; in-memory processing; standard memory interface; all-bank execution
Citation
IEEE ACCESS, v.10, pp.93256 - 93272
Indexed
SCIE
SCOPUS
Journal Title
IEEE ACCESS
Volume
10
Start Page
93256
End Page
93272
URI
https://scholar.korea.ac.kr/handle/2021.sw.korea/147124
DOI
10.1109/ACCESS.2022.3203051
ISSN
2169-3536
Abstract
Processing-in-Memory (PIM) has been actively studied to overcome the memory bottleneck by placing computing units near or in memory, especially for efficiently processing low locality data-intensive applications. We can categorize the in-DRAM PIMs depending on how many banks perform the PIM computation by one DRAM command: per-bank and all-bank. The per-bank PIM operates only one bank, delivering low performance but preserving the standard DRAM interface and servicing non-PIM requests during PIM execution. The all-bank PIM operates all banks, achieving high performance but accompanying design issues like thermal and power consumption. We introduce the memory-computation decoupling execution to achieve the ideal all-bank PIM performance while preserving the standard JEDEC DRAM interface, i.e., performing the per-bank execution, thus easily adapted to commercial platforms. We divide the PIM execution into two phases: memory and computation phases. At the memory phase, we read the bank-private operands from a bank and store them in PIM engines' registers bank-by-bank. At the computation phase, we decouple the PIM engine from the memory array and broadcast a bank-shared operand using a standard read/write command to make all banks perform the computation simultaneously, thus reaching the computing throughput of the all-bank PIM. For extending the computation phase, i.e., maximizing all-bank execution opportunity, we introduce a compiler analysis and code generation technique to identify the bank-private and the bank-shared operands. We compared the performance of Level-2/3 BLAS, multi-batch LSTM-based Seq2Seq model, and BERT on our decoupled PIM with commercial computing platforms. In Level-3 BLAS, we achieved speedups of 75.8x , 1.2x, and 4.7x compared to CPU, GPU, and the per-bank PIM and up to 91.4% of the ideal all-bank PIM performance. Furthermore, our decoupled PIM consumed less energy than GPU and the per-bank PIM by 72.0% and 78.4%, but 7.4%, a little more than the ideal all-bank PIM.
Files in This Item
There are no files associated with this item.
Appears in
Collections
College of Engineering > School of Electrical Engineering > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kim, Seon Wook photo

Kim, Seon Wook
College of Engineering (School of Electrical Engineering)
Read more

Altmetrics

Total Views & Downloads

BROWSE