Non-inclusive cache method using pipelined snoop bus
A non-inclusive cache system includes an external cache and a plurality of on-chip caches each having a set of tags associated therewith, with at least one of the on-chip caches including data which is absent from the external cache. A pipelined snoop bus is ported to each of the set of tags of the plurality of on-chip caches and transmits a snoop address to the plurality of on-chip caches. A system interface unit is responsive to a received snoop request to scan the external cache and to apply the snoop address of the snoop request to the pipelined snoop bus. A plurality of response signal lines respectively extend from the plurality of on-chip caches to the system interface unit, each of the signal lines for transmitting a snoop response from a corresponding one of the on-board caches to the system interface unit. The set of tags can be implemented by dual-porting the cache tags, or by providing a duplicate and dedicated set of snoop tags.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to microprocessor architectures, and more particularly, the present invention relates to a method using a pipelined snoop bus for maintaining coherence among caches in a multiprocessor configuration.
2. Description of the Related Art
In multiprocessor systems, processor cache memories often maintain multiple copies of a same data object. When one processor alters one copy of the data object, it is necessary to somehow update or invalidate all other copies of the object which may appear elsewhere in the multiprocessor system. Thus, to insure coherence among multiple copies, every valid write to one copy of an object must update or invalidate every other copy of the object.
Consider, for example, the conventional multiprocessor configuration and located to the right of the dashed line are external (EXT.) components 104. Reference numeral 106 denotes an external cache (e-cache) which is visible to all processors and which interfaces with a main memory (not shown). Access to and from the main memory can only occur through the e-cache 106.
The CPU 102 contains multiple processors which share the main memory via the common memory bus (not shown) and the e-cache 106. When one processor is granted exclusive use of a data object, the object is placed in the external cache 106 and used on the CPU chip 102 until it is taken away or evicted from the e-cache 106. Illustrated within the CPU 102 are the on-board caches 108 and 110 associated with one processor. Cache 108 is a data cache (d-cache) for storing data as it is passed back and forth from the execution units of the processor, and cache 110 is an instruction cache (i-cache) holding instructions prior to execution by the processor's execution units.
Reference numeral 112 denotes an interface unit. When a processor desires exclusive use of an object from main memory, the corresponding interface unit 112 issues a snoop request. Snooping protocols are generally designed so that all memory access requests are observed by each cache. In the event of a coherent write, each cache is responsive to the snoop request to scan its directory to identify any copies of the object which may require invalidation or updating. However, to avoid searching every cache directory upon the occurrence of every coherent write, the conventional systems adopt an "inclusive" approach to the cache coherencies.
The basic principal underlying cache coherency schemes is that when one processor is granted exclusive use of a data object, all other processors invalidate that data in their own memories. In the conventional inclusive cache coherency structure, the e-cache includes data existing in all the other caches on the chip. That is to say, any data that exists on the on-board caches of the chip must exist in the e-cache as well. If a data object gets evicted out of the e-cache or snooped out of the e-cache, it is removed from all the on-chip caches.
As such, referring to the flowchart of FIG. 2, when a snoop comes in from some other processor (step 202), the system interface unit 112 looks to the e-cache first to scan its contents (step 204), and if the data object is not there (NO at step 206), snooping is complete since the data object cannot exist on the on-board caches of the chip. Again, this is because every time something is evicted from the e-cache, it is invalidated on each of the on-board caches. If the data is found in the e-cache (YES at step 206), then the interface unit 112 sends out a signal to invalidate the data as it exists on the on-board caches.
Since snoop processing is complete when the data is not found in the e-cache, the conventional technique of looking first to the e-cache for the data has the effect of filtering the snoop requests applied to the on-board cache memories of the processors. This in turn reduces the average bandwidth of the on-board snoop processing.
However, the conventional scheme does suffer drawbacks. For example, each time a data object is evicted from the e-cache, it must be invalidated on each of the on-board memories to preserve the inclusiveness of the configuration. If the e-cache is a large direct-mapped cache, and something is evicted, it must be evicted (invalidated) in all the lower level caches as well, even if not necessary. This often results in inefficiencies, since the e-cache might have collisions which are not present in the on-board caches. This ultimately results in a reduction in the cache hit rate.
Further, it is always possible for a number of snoop requests to hit the e-cache in a row which require invalidates in the on-board memories, and thus, the chip must support this "peak" bandwidth. Thus, the filtering is of limited value since over any given stretch of time, it may be necessary to carry out on-board snoop processing at full bandwidth.
SUMMARY OF THE INVENTION
It is an object of the present invention is to overcome or at least minimize the drawbacks associated with the conventional snooping scheme described above.
It is a further object of the present invention to provide a snoop process for ensuring cache coherency without use of an inclusive e-cache arrangement in which all data found in the on-chip caches must be present in the e-cache as well.
According to one aspect of the invention, a non-inclusive cache method is provided for a processor system having an external cache and a plurality of on-chip caches, including: including data in at least one of the on-chip caches which is absent from the external cache; scanning the external cache and applying a snoop address of a snoop request to a pipelined snoop bus in response to receipt of the snoop request; and transmitting the snoop address to the plurality of on-board caches via the pipelined snoop bus which is ported to each of a set of tags associated with the plurality of on-chip caches.
According to another aspect of the invention, the method further includes transmitting a snoop response from a corresponding one of the on-chip caches to a system interface unit via a plurality of response signal lines respectively extending from the plurality of on-chip caches to the system interface unit.
According to yet another aspect of the invention, a same multiple number of clock cycles are expended between a transmission of the snoop address on the pipelined snoop bus to receipt of the snoop response from each of the on-chip caches.
According to still another aspect of the invention, the snoop request is either one of two types, a first type being a request to invalidate a data object contained in any of the on-chip caches, and a second type being a request to check for the presence of the data object in any of the on-chip caches.
According to another aspect of the invention, the set of tags includes a set cache tags and a dedicated set of snoop tags duplicating the set of cache tags, and the pipelined snoop bus is ported to each of the dedicated set of snoop tags.
According to still another aspect of the invention, the sets of tags includes a set of dual-port cache tags.
BRIEF DESCRIPTION OF THE DRAWINGS
The above and other objects and advantages of the present invention will become readily apparent to those skilled in the art from the description that follows, with reference to the accompanying drawings, in which:
FIG. 1 is a block diagram for explaining the conventional inclusive-type cache coherency configuration;
FIG. 2 is a simplified flowchart for explaining the snooping protocol of the configuration shown in FIG. 1;
FIG. 3 is a block diagram for explaining the non-inclusive cache coherency configuration of the present invention; and,
FIG. 4 is a simplified flowchart for explaining the snooping protocol of the configuration shown in FIG. 3.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention, an exemplary embodiment of which is shown in FIG. 3, presents a non-inclusive implementation of a cache coherency scheme. That is, there is no requirement that all data contained in the on-chip caches also be present in the off-chip e-cache as well.
Referring to FIG. 3, to the left of the vertical dashed line sits the CPU chip 302, and located to the right of the dashed line are external (EXT.) components 304. Reference numeral 306 denotes an external cache (e-cache) which is visible to all processors and which interfaces with a main memory (not shown). Access to and from the main memory can only occur through the e-cache 306.
The CPU chip 302 contains multiple processors which share the main memory via the common memory bus (not shown) and the e-cache 306. When one processor is granted exclusive use of a data object, the object is placed in the external cache 306 and used on the CPU chip 302 until it is taken away or evicted from the e-cache 306. Illustrated within the CPU chip 302 are the on-board caches 308 and 310 associated with one processor. Cache 308 is a data cache (d-cache) for storing data as it is passed back and forth from the execution units of the processor, and cache 310 is an instruction cache (i-cache) holding instructions prior to execution by the processor's execution units. Each processor may have other types of caches as well.
Each cache includes special snoop tags 308a and 310a which effectively duplicate the corresponding cache tags. Alternatively, the cache tags themselves may be dual-ported to provide a dedicated set of tags ports for snooping. In either case, the arrangement should support a full snoop bandwidth.
Reference numeral 312 denotes a system interface unit (SIU). When a processor desires exclusive use of an object from main memory, the corresponding interface unit 312 issues a snoop request. Reference number 314 is a dedicated pipelined snoop bus which transmits the snoops in a pipeline, and reference numerals 316 and 318 are dedicated snoop response lines.
The operation of the invention will now be described with reference to the flowchart of FIG. 4, as well as the block diagram of FIG. 3.
A snoop arrives to the system interface unit 312 (step 402), and, in the normal manner, the tags on the e-cache 306 are checked right away (step 404). A response is returned from the e-cache 306 indicating whether one of the e-cache tags matches the snoop address. Even in the event that the object is not found in the e-cache 306, it is still necessary to examine the on-chip caches. Again, the cache coherency system of the invention is non-inclusive, meaning that data may exist in an on-chip cache and not be present in the e-cache 306.
Thus, in addition to checking the e-cache in steps 404 and 406, the SIU must check the on-chip caches as well. This is illustrated by steps 408-422 which run in parallel to steps 404 and 406.
Initially, the SIU 312 applies the snoop to the pipelined snoop bus 314 (step 408), and the snoop address is checked against the special dedicated snoop tags of each cache (step 410). It is noted that there is no filtering of snoops in the present invention, and thus, the snoops tags are designed at full bandwidth.
In one embodiment of the invention, there are two types of snoop requests. The first simply invalidates the object, if it is present, and requires no response back. The second (called a "shared response" herein) checks for the presence of an object and thus requires a response back. In the case of the former (YES at step 412), the object in the cache is invalidated (step 416) if it exists, i.e., if there is a match between the snoop address and a cache tag (YES at step 414). In the case of the later (NO at step 412), responses are sent back to the SIU 312 on lines 316 and 318 indicating that the data object is located in the on-chip caches (step 422) or is not located in the on-chip caches (step 420). The lines 316 and 318 are preferably one-bit wide, indicating the presence or absence of the snooped data object in each cache.
According to the invention, the snoop bus 314 is pipelined and goes to its own dedicated port of tags. Moreover, the snoops do not arrive at the tags in one cycle. Rather, in the embodiment of the invention, it takes two cycles to get to the chip tag, and then up to three cycles to go through them, and then another two cycles to go back to the SIU 312. Reference numeral 320 is a delay which is representative of the lengthening of the loops. Since it is generally not possible to get all the way across the chip and back in one cycle, the loops are designed to have the same operational length of a specific number of cycles. In this manner, the SIU 312 knows the timing (number of cycles) of the response back from the caches. Moreover, the pipelined bus is a part of the implementation that allows the system to work at the clock frequency.
Thus, according to the invention, the snoop bus and duplicated snoop tag RAMs are fully pipelined. All snoops (invalidate and share) are handled by all the on-chip caches since such caches may contain data not found in the e-cache. Also, shared responses are of a fixed latency to the snoop originator or system interface unit.
SRC=https://www.google.com.hk/patents/US6061766
Non-inclusive cache method using pipelined snoop bus的更多相关文章
- Distributed Cache Coherence at Scalable Requestor Filter Pipes that Accumulate Invalidation Acknowledgements from other Requestor Filter Pipes Using Ordering Messages from Central Snoop Tag
A multi-processor, multi-cache system has filter pipes that store entries for request messages sent ...
- Method and apparatus for verification of coherence for shared cache components in a system verification environment
A method and apparatus for verification of coherence for shared cache components in a system verific ...
- System and method for cache management
Aspects of the invention relate to improvements to the Least Recently Used (LRU) cache replacement m ...
- cache基础
cache是系统中的一块快速SRAM,价格高,但是访问速度快,可以减少CPU到main memory的latency. cache中的术语有: 1) Cache hits,表示可以在cache中,查找 ...
- TMS320C64x DSP L1 L2 Cache架构(1)——C64x Cache Architecture
[前沿]研究生阶段从事于DSP和FPGA技术的相关研究工作,学习并整理了大量的技术资料,包括TI公司的官方文档和网络上的详细笔记,花费了大量的时间和精力总结了前人的工作成果.无奈工作却从事于嵌入式技术 ...
- Snoop resynchronization mechanism to preserve read ordering
A processor employing a post-cache (LS2) buffer. Loads are stored into the LS2buffer after probing t ...
- Method, apparatus and system for acquiring a global promotion facility utilizing a data-less transaction
A data processing system includes a global promotion facility and a plurality of processors coupled ...
- Multiprocessing system employing pending tags to maintain cache coherence
A pending tag system and method to maintain data coherence in a processing node during pending trans ...
- Interrupt distribution scheme for a computer bus
A method of handling processor to processor interrupt requests in a multiprocessing computer bus env ...
随机推荐
- No task executor bean found for async processing: no bean of type TaskExecut
使用springcloud,添加异步方法后,调用异步成功,但有个 No task executor bean found for async processing: no bean of type T ...
- 二分搜索 POJ 2456 Aggressive cows
题目传送门 /* 二分搜索:搜索安排最近牛的距离不小于d */ #include <cstdio> #include <algorithm> #include <cmat ...
- Java 8 (3) Stream 流 - 简介
什么是流? 流是Java API的新成员,它允许你以声明性方式处理数据集合(通过查询语言来表达,而不是临时编写一个实现).就现在来说你可以先把它当做是一个遍历数据集的高级迭代器.此外,流还支持并行,你 ...
- 这样的设计是否有违背MVC设计原则??
MVC 皆知为 Model-View-Controller 请求模型-〉Client发现请求-〉Controller接收+处理-〉返回Model给前端-〉前端接收处理模型Result 但是最近发现一个 ...
- tomcat 访问IP直接访问项目
apache-tomcat-7.0.52\conf下server.xml文件 <Connector connectionTimeout="20000" port=" ...
- iOS,Core Animation--负责视图的复合功能
简介 UIKit API UIKit是一组Objective-C API,为线条图形.Quartz图像和颜色操作提供Objective-C 封装,并提供2D绘制.图像处理及用户接口级别的动画. ...
- python游戏开发:pygame事件与设备轮询
一.pygame事件 1.简介 pygame事件可以处理游戏中的各种事情.其实在前两节的博客中,我们已经使用过他们了.如下是pygame的完整事件列表: QUIT,ACTIVEEVENT,KEYDOW ...
- #NOIP前数学知识总结
我好菜啊…… 欧拉函数 欧拉函数φ(n),是小于n且和n互质的正整数(包括1)的个数. 性质: 1.对于质数n: φ(n)=n-1 2..对于n=pk φ(n)=(p-1)*pk-1 3.积性函数的性 ...
- Leetcode加一 (java、python3)
加一 给定一个由整数组成的非空数组所表示的非负整数,在该数的基础上加一. 最高位数字存放在数组的首位, 数组中每个元素只存储一个数字. 你可以假设除了整数 0 之外,这个整数不会以零开头. Given ...
- 折线分割平面(hdoj 2050,动态规划递推)
Problem Description 我们看到过很多直线分割平面的题目,今天的这个题目稍微有些变化,我们要求的是n条折线分割平面的最大数目.比如,一条折线可以将平面分成两部分,两条折线最多可以将平面 ...