最近(以及预感接下来的一年)会读很多很多的paper......不如开个帖子记录一下读paper心得


Computer Architecture

Last level cache (llc) performance of data mining workloads on a cmp-a case study of parallel bioinformatics workloads

这是HPCA2006的文章,但里面的方法论放到现在也很有借鉴意义。

本文通过分析Bioinformatic workload在Last Level Cache上的性能,发现这种workload中很多cacheline其实都是shared by multiple cores,那么一些对于其他workload适用的策略(比如partitioning LLC into multiple private caches)就不行了。

前面介绍了一下他们的workload(这个就不看了)和simulator(是基于Pin开发的一个插件叫做SimCMPcache,但很可惜没有开源)。

为了分析LLC的使用情况(sharing behavior of parallel workloads),本文定义了如下三种metrics:

  • Shared Cache Line:被多个线程(core)共享的cacheline。这里又可以分为Read-only shared cacheline(只被读,不写)和Read-Write shared cacheline(既要读也要写,常用于线程通信)。我们这里要找出哪些cacheline被多少个core共享了,以及它们分别是Read-Only还是Read-Write
  • Shared Access:有哪些cache access是涉及到shared cacheline的。
  • Active-Shared Access:被多个不同的core轮流访问的cacheline。

接下来就是展示实验结果了。我们先来看Fig2的结果(每个case中的五个柱表示测试了5种cache容量:4M/8M/16M/32M/64M LLC):

Managing shared last-level cache in a heterogeneous multicore processor

When integrating GPU and CPU on the same die of chip, they could share the same Last Level Cache(LLC). Due to much higher number of threads in GPU, GPU may dominate the access to the shared LLC. However, in many scenarios, CPU applications are more sensitive to LLC than GPU applications, while GPU applications can often tolerate relatively higher memory access latency (not really sensiable to LLC). It is because for CPU cores, changes in cache miss rate are a direct indicator of cache sensitivity. While in the GPU core, an increase in cache miss rate does not necessarily result in performance degradation, because the GPU kernel can tolerate memory access latency by context switching between a large number of concurrent active threads. So a new cache replacement policy, called HeLM, is proposed to solve this issue.

Under the HeLM policy, GPU LLC accesses are limited by allowing memory accesses to selectively bypass the LLC, so the cache sensitive CPU application is able to utilize a larger portion of the cache. It is achieved by allowing the GPU memory traffic to selectively bypass the LLC when the GPU cores exhibit sufficient TLP to tolerate memory access latency, or when the GPU application is not sensitive to LLC performance.

The author implemented 3 technologies: 1). Measure the sensitivity of GPU LLC and CPU LLC. 2). Determining Effective TLP Threshold. 3). Using Threshold Selection Algorithm(TSA) to monitors the workload characteristics continuously and re-evaluates the TLP threshold at the end of every sampling period. Then enforce this threshold on all GPU cores.

The author evaluated the performance of HeLM by comparing with DRRIP, MAT, SDBP, and TAP-RRIP, all normalized to LRU. HeLM outperforms all these mechanisms in overall system performance.

Adaptive Insertion Policies for High Performance Caching

The traditional cache LRU mechanism replaces the item at MRU position with the new coming line, which could lead to thrashing problem when a memory-intensive workload whose size is greater than the available cache size.  The authors of this paper introduce new mechanism: Dynamic Insertion Policy that can protect cache from thrashing with trial overhead, and improve the average cache hit performance.

1. The authors propose LRU Insertion Policy (LIP), which inserts all incoming lines to the LRU position because of the fact that a long-enough fraction of a workload can lead to a cache hit even though the workload’s size exceeds the available size.

2. Based on LIP, the authors propose the Bimodal Insertion Policy (BIP), which is basically LIP but also implement traditional LRU. It has a parameter e: bimodal throttle parameter that is a very small number, which controls the percentage of incoming lines that be inserted to MRU position.

3. They design experiment to explore how the reduction in L2 MPKI performs with LIP and BIP with e = 1/64, 1/32, 1/16 on 16 different benchmarks. In general, BIP outperforms LIP and the value of e doesn’t affect much of the result. LIP and BIP reduce MPKI by 10% or more for nine benchmarks. The other benchmarks are either LRU-friendly or have knee of MPKI curve is less than cache size.

4. The authors further propose Dynamic Insertion Policy (DIP) which can dynamically make choices from BIP and LRU, which incurs the fewest misses. It introduces MTD be the main tag directory of the cache, ADT-LRU as traditional LRU tag, and ATD-BIP as the BIP tag. PSEL is Policy Selector which is a saturating counter. Using PSEL count for each set needs more hardware resources thus Dynamic Set Sampling (DSS) come to use.

The key idea of DSS is that the cache behavior can be estimated with high probability by sampling. It introduces Set Dueling, which only keeps the MTD and eliminate the need to store separate ATD entries. They use complement-select policy that identify a set as a dedicated for competing or a follower.

5. The experiments show that DIP can reduce MPKI of a 1MB 16-way L2 cache by 21% in average.

BEAR: Techniques for Mitigating Bandwidth Bloat in Gigascale DRAM Caches

The bandwidth of DRAM cache is not only used for data transferring. Some of its bandwidth are consumed by some secondary operations, like cache miss detection, fill on cache miss, and writeback lookup and content update on dirt evictions from the last-level on-chip cache. In order to decrease the bandwidth consumed by these secondary operations, the previous work, Alloy Cache, still consumes 3.8x bandwidth compared to an idealized DRAM cache that does not consume any bandwidth for secondary operations. In this work, the author proposed Bandwidth Efficient Architecture (BEAR), which include 3 components to reduce the bandwidth consumed by miss detection, miss fill, and writeback probes respectively.

The author found that DRAM cache bandwidth bloat is attributed to six different cache operations: Hit Probe, Miss Probe, Miss Fill, Writeback Probe, Writeback Update, and Writeback Fill. And only the Hit Probe leads useful bandwidth to service the LLC miss request. So the author concentrated on 3 sources of bandwidth bloat:

  1. Bandwidth-Efficient Miss Fill. The author proposed a Bandwidth Aware Bypass (BAB) scheme that ties to free up the bandwidth consumed by Miss Fills while limiting the degradation in hit rate to a predetermined amount. It is inspired by the fact that not all inserted cache lines will be re-referenced again immediately, so we can bypass some of the Miss Fills without impacting hit rate significantly.
  2. Bandwidth-Efficient Writeback Probe. A Writeback Probe is wasteful if the line evicted from the on-chip LLC (dirty line) already exists in the DRAM cache. Since the DRAM cache are generally much larger than on-chip LLC, the writeback request is less likely to be missed in the DRAM cache. So a majority of Writeback Probes are useless. So the cache architecture need to guarantee whether a dirty cache line evicted from the on-chip LLC exists in DRAM cache.
  3. Bandwidth-Efficient Miss Probe. If the cache architecture can provide guarantees on whether a line is present in the DRAM cache, Miss Probe bandwidth bloat can be minimized. So the author did a survey on some previous work, like Alloy Cache and Loh-Hill. Their designs locate tag and data together in the same DRAM row buffer, and hence accessing one cache line also reads tags of other adjacent lines, making additional information available.

Finally, the author integrated these components together, and proposed BEAR. The experiments show that BEAR improved performance over the Alloy Cache by 10.1%, and reduced the bandwidth consumption of DRAM cache by 32%.

BATMAN: Techniques for Maximizing System Bandwidth of Memory Systems with Stacked-DRAM

Tiered Memory System includes the following 2 components: 3D-DRAM with high bandwidth, and commodity-DRAM with high capacity. The previous works try to maximize the usage of 3D-DRAM bandwidth, but actually, the bandwidth of commodity-DRAM is the significant fraction of the overall system bandwidth. So they inefficiently utilized the total bandwidth of the tiered system.

For a tiered-memory system in which the Far Memory (DDR-based DRAM) accounts for a significant fraction of the overall bandwidth, distributing memory accesses to both memories has great ability to improve performance. In order to solve this issue, the author proposed Bandwidth-Aware Tiered-Memory Management (BATMAN), a runtime mechanism that manages the memory accesses distribution in proportion to the bandwidth ratio of the Near Memory and the Far Memory. It could explicitly control the data management. The main idea of BATMAN is that when 3D-DRAM memory access is greater than a threshold, it will move data from 3D-DRAM to commodity-DRAM.

The evaluation shows that BATMAN could improve performance by 10% and energy-delay product by 13%. Also, it incurs only an eight-byte hardware overhead and requires negligible software modification.

Paper Reading_Computer Architecture的更多相关文章

  1. how to build a paper's architecture?

    1. problem? what's the problem? then do some extension of the problem. 2. related works ,which means ...

  2. Paper Reading

    Paper Reading_SysML Paper Reading_Computer Architecture Paper Reading_Database Paper Reading_Distrib ...

  3. MongoDB 初见指南

    技术若只如初见,那么还会踩坑么? 在系统引入 MongoDB 也有几年了,一开始是因为 MySQL 中有单表记录增长太快(每天几千万条吧)容易拖慢 MySQL 的主从复制.而这类数据增长迅速的流水表, ...

  4. Rethinking the inception architecture for computer vision的 paper 相关知识

    这一篇论文很不错,也很有价值;它重新思考了googLeNet的网络结构--Inception architecture,在此基础上提出了新的改进方法; 文章的一个主导目的就是:充分有效地利用compu ...

  5. Paper | UNet++: A Nested U-Net Architecture for Medical Image Segmentation

    目录 1. 故事 2. UNet++ 3. 实验 3.1 设置 作者的解读,讲得非常好非常推荐:https://zhuanlan.zhihu.com/p/44958351 这篇文章提出的嵌套U-Net ...

  6. 如何写好一篇高质量的paper

    http://blog.csdn.net/tiandijun/article/details/41775223 这篇文章来源于中科院Zhouchen Lin 教授的report,有幸读到,和大家分享一 ...

  7. paper 13:计算机视觉研究群体及专家主页汇总

    做机器视觉和图像处理方面的研究工作,最重要的两个问题:其一是要把握住国际上最前沿的内容:其二是所作工作要具备很高的实用背景.解决第一个问题 的办法就是找出这个方向公认最高成就的几个超级专家(看看他们都 ...

  8. Core Java Volume I — 1.2. The Java "White Paper" Buzzwords

    1.2. The Java "White Paper" BuzzwordsThe authors of Java have written an influential White ...

  9. 100 open source Big Data architecture papers for data professionals

    zhuan :https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan Big Da ...

随机推荐

  1. mysql AVG()函数 语法

    mysql AVG()函数 语法 作用:AVG 函数返回数值列的平均值.NULL 值不包括在计算中.大理石模组 语法:SELECT AVG(column_name) FROM table_name m ...

  2. SpringBoot项目中,cookie的设置与销毁

    cookie的设置与销毁 1.设置cookie /** * 设置一个cookie * @param response HttpServletResponse * @param name cookie的 ...

  3. BZOJ 4517: [Sdoi2016]排列计数 错排 + 组合

    从 $n$ 个数中选 $m$ 个不错排,那就是说 $n-m$ 个数是错排的. 用组合数乘一下就好了. Code: #include <cstdio> #include <algori ...

  4. yield return的使用。。。

    因为要取两个集合不同的元素,所以写了个拓展方法,用到了yield这个关键字,然后就学习了一波.先上代码 public static IEnumerable<T> NoRetainAll&l ...

  5. shell基础操作

    一.字符串 字符串是shell编程中最常用的数据类型,字符串可以用单引号,也可以用双引号,也可以不用引号. 单引号 name='xiaoxi' 单引号的限制: 单引号里的任何字符都会原样输出,单引号中 ...

  6. NOIP2018复赛游记

    Day -oo (这里是负无穷啦qwq) 为了NOIP2018,我已经准备了好久 暑假的封闭式训练,国庆后停掉了晚自习,距NOIP一周时更是停掉了全天的课...... 我像是一个圆,在近乎无限的空间里 ...

  7. windows7 中 wacom数位板如何关闭点击水波 和长按右键这两个特效

    就是点住笔尖不动,就会弹出右键,这个功能是微软操作系统具有的一项功能,,如果您感觉不便,可以按以下方法将其去掉: 1.打开"控制面板--笔和触摸--笔选项--按下并保持--设置": ...

  8. 假设高度已知,请写出三栏布局,其中左栏、右栏各为300px,中间自适应的五种方法

    假设高度已知,请写出三栏布局,其中左栏.右栏各为300px,中间自适应的五种方法 HTML CSS 页面布局 题目:假设高度已知,请写出三栏布局,其中左栏.右栏各为300px,中间自适应 <!D ...

  9. 四十一、python面向对象二

    A.成员: 1.字段:静态字段(每个对象都有同一字段),普通字典(每个对象都有不同的数据) 2.方法:静态方法(无需使用对象的内容),类方法,普通方法(使用对象中的数据) 3.特性:普通特性(将方法伪 ...

  10. net.sf.json和com.alibaba.fastjson两种json加工类的相关使用方法

    com.alibaba.fastjson Fastjson是一个Java语言编写的高性能功能完善的JSON库.它采用一种“假定有序快速匹配”的算法,把JSON Parse的性能提升到极致,是目前Jav ...