PatentTips - Write Combining Buffer for Sequentially Addressed Partial Line Operations

SUMMARY OF THE INVENTION

The present invention pertains to a write combining buffer for use in a microprocessor. The microprocessor fetches data and instructions which are stored by an external main memory. The data and instructions are sent over a bus. The microprocessor then processes the data according to the instructions received. When the microprocessor completes a task, it writes the data back to the main memory for storage. In the present invention, a write combining buffer is used for combining the data of at least two write commands into a single data set, wherein the combined data set is transmitted over the bus in one clock cycle rather than two or more clock cycles. Thereby, bus traffic is minimized.

In the currently preferred embodiment, the write combining buffer is comprised of a single line having a 32-byte data portion, a tag portion, and a validity portion. The tag entry specifies the address corresponding to the data currently stored in the data portion. There is one valid bit corresponding to each byte of the data portion which specifies whether that byte currently contains useful data. So long as subsequent write operations to the write combining buffer result in hits, the data is written to the buffer's data portion. In other words, write hits to the write combining buffer result in the data being combined with previous write data. But when a miss occurs, the line is reallocated, and the old data is written to the main memory. Only those bytes which have been written to as indicated by the valid bits, are written back to the main memory. Each time the write combining buffer is allocated, the valid bits are cleared. Thereupon, the new data and its address are written to the write combining buffer.

DETAILED DESCRIPTION

Referring to FIG. 1, the computer system upon which a preferred embodiment of the present invention is implemented is shown as 100. Computer system 100 comprises a bus or other communication means 101 for communicating information, and a processing means 102 coupled with bus 101 for processing information. Processor 102 includes, but is not limited to microprocessors such as the Intel™ architecture microprocessors, PowerPC™, Alpha™, etc. System 100 further comprises a random access memory (RAM) or other dynamic storage device 104 (referred to as main memory), coupled to bus 101 for storing information and instructions to be executed by processor 102. Main memory 104 also may be used for storing temporary variables or other intermediate information during execution of instructions by processor 102. Computer system 100 also comprises a read only memory (ROM) and/or other static storage device 106 coupled to bus 101 for storing static information and instructions for processor 102, and a data storage device 107 such as a magnetic disk or optical disk and its corresponding disk drive. Data storage device 107 is coupled to bus 101 for storing information and instructions.

Referring now to FIG. 2, a block diagram illustrating an exemplary processor 102 incorporating the teachings of the present invention is shown. The exemplary processor 102 comprises an execution unit 201, a bus controller 202, a data cache controller 203, a data cache unit 204, and an instruction fetch and issue unit 205 with an integrated instruction cache 206. The elements 201-206 are coupled to each other as illustrated. Together they cooperate to fetch, issue, execute, and save execution results of instructions in a pipelined manner.

In the currently preferred embodiment, a write combining buffer 208 is implemented as part of the data cache unit 204. The write combining buffer 208 collects write operations which belong to the same cache line address. Several small write operations (e.g., string moves, string copies, bit block transfers in graphics applications, etc.) are combined by the write combining buffer 208 into a single, larger write operation. It is this larger write, which is eventually sent over the bus, thereby maximizing the efficiency of bus transmission. In one embodiment, the write combining buffer 208 resides within the fill buffer 211.

The write combining function is an architectural extension to the cache protocol. Special micro-operations (uops are simple instructions including its micro-opcode, source fields, destination, immediates, and flags) are defined for string store operations and stores to an USWC memory types to cause the data cache unit to default to a write combining protocol instead of the standard protocol. Write combining is allowed for the following memory types: Uncached SpeculatabIe Write Combining (USWC), Writeback (WB), and Restricted Caching (RC). The USWC memory type is intended for situations where use of the cache should be avoided for performance reasons. The USWC memory type is also intended for situations where data must eventually be flushed out of the processor, but where delaying and combining writes is permissible for a short time. The WB memory type is conventional writeback cached memory. The RC memory type is intended for use in frame buffers. It is essentially the same as the WB memory type, except that no more than a given (e.g., 32K) amount of RC memory will ever be cached. Write combining protocol still maintains the coherency with external writes. It avoids the penalty of coherence by deferring the coherence actions until eviction.

An implementation of the current invention is possible whereby there are multiple WC buffers, permitting interleaved writes to different addresses to be combined. This would create a weakly ordered memory model, however, and therefore could not be used by some existing programs.

The currently preferred embodiment appears to have only one WC buffer evicting when missed. This permits the WC buffer to be used when a write ordering is required; for example, it permits the invention to be used in the existing Intel™ Architecture block memory instructions PEP STOSX and PEP MOVSX.

The currently preferred embodiment uses a structure that already exists in the Intel™ Architecture microprocessor, the fill buffers. The fill buffers are a set of several (4) cache lines with byte granularity valid and dirty bits, used by the out-of-order microprocessor to create a non-blocking cache. The WC buffer is a single fill buffer marked to permit WC stores to be merged. When evicted, the WC fill buffer waits until normal fill buffer eviction.

In the currently preferred embodiment, only one write combining buffer is implemented. Physically, any fill buffer can be used as the write combining buffer. Since only one logical write combining buffer is provided, when a second write combining buffer is needed, an eviction process is initiated. During eviction, one of the following actions can occur. If all the bytes are written, and the write combining buffer is of cacheable (i.e., RC or WB) type, then the data cache unit requests an AllocM transaction to the bus. An AllocM transaction is where a bus transaction that causes all other processors to discard stale copies of the cache line without supplying the data. When this transaction is completed, the line is placed in the cache. If all the bytes are not written, and the write combining buffer is of a cacheable (i.e., RC or WB) type, then the data cache unit requests read-for-ownership (RFO) transaction to the bus. The RFO transaction entails a read directing any other processor to supply data and relinquish ownership. Thereupon, the line is placed in the cache. If all the bytes are written and the write combining buffer is of the USWC type, then the data cache unit requests a writeback transaction to the bus. If all the bytes are not written, and the write combining buffer is of the USWC type, then the data cache unit evicts the write combining buffer. The eviction is performed as a sequence of up to four partial writes of four sets of data. The data cache unit supplies eight byte enables to the bus with each set. If a data set does not contain any written bytes, the data cache unit does not issue a partial write for that set.

FIG. 3 shows a more detailed block diagram of the data cache unit 300. The data cache unit 300 includes a level 1 data cache 301 and a write combining buffer 302. The level 1 data cache 301 is a standard SRAM writeback cache memory. In a writeback configuration, the CPU updates the cache during a write operation. The actual main memory is updated when the line is discarded from the cache. Level 1 data cache 301 includes a data cache RAM portion 303 which is used to store copies of data or instructions. A separate tag RAM portion 304 is used as a directory of entries in data RAM portion 303. A number of tags corresponding to each entry are stored in tag RAM 304. A tag is that portion of the line address that is used by the cache's address mapping algorithm to determine whether the line is in the cache.

Write combining buffer 302 is comprised of a single line having a data portion 305, a tag portion 306, and a validity portion 307. Data portion 305 can store up to 32 bytes of user data. Not every byte need contain data. For example, the execution unit may choose to store data in alternating bytes. The validity portion 307 is used to store valid bits corresponding to each data byte of data portion 305. The valid bits indicate which of the bytes of data portion 305 contain useful data. In the above example wherein data is stored in alternating bytes, every other valid bit is set. In this manner, when the line in the write combining buffer 302 is written to the level 1 data cache 301, only those bytes containing valid data are stored.

When data is being written to the data cache unit 300, there are three possible scenarios that can occur. First, there could be a level 1 data cache hit. A cache hit is defined as a data or instruction cycle in which the information being read or written is currently stored in that cache. In this situation, the data is directly copied to the level 1 data cache 301. For example, a write combine store byte uop (i.e., WC Stob instruction) having an address 1 and data 1 falls in this scenario because the tag column 304 of level 1 data cache 301 currently contains a tag of <addr1>. Thus, <data1> is stored in the data portion 303 of the level 1 data cache 301.

In the second scenario, the write operation results in a hit of the write combining buffer 302. In this case, the data is stored in the write combining data portion 305. For each byte that is written, the corresponding valid bit is set. For example, a write combine store byte uop (i.e., WC Stob) having an address of <addr2> has its data <data2> written to the data portion 305 of write combining buffer 302 because there is a miss of <addr2> in the level 1 data cache 301, and there is a hit of <addr2> in the write combining buffer 302. Any subsequent write operations that fall within the 32-byte data field will be written to the write combining buffer 302 until that line eventually is evicted and a new address (i.e., tag) is assigned. For example, suppose that the tag of the write combining buffer contains the address 0X12340. Subsequently, a write combine store word uop (i.e., WC Stow) to 0X12346 is received. Since the 0X12346 address falls within the 32-byte range of 0X12340, that word is stored in the write combining buffer. In contrast, if a WC Stow to address 0X12351 request is received, the write combining buffer must be reallocated because the address fails outside the 32-byte boundary.

In the third scenario, there is a complete miss to both the level 1 data cache 301 and the write combining buffer 302. For this scenario, the contents in the write combining buffer 302 are purged to the main memory (not shown). All of the valid bits are then cleared. The new data is stored in the data portion 305; its address is stored in the tag portion 306; and the valid bits corresponding to those bytes of data which were written are set. For example, a write combine store byte uop (i.e., WC Stob) having an address of <addr3> will result in a miss of both the level 1 data cache 301 and the write combining buffer 302. Hence, the <data2> currently stored in write combining buffer 302 is written to the main memory at a location specified by <addr2>. Thereupon, <data3> can be stored in the data portion 305 of write combining buffer 302. Its address <addr3> is stored in the tag portion 306, and the appropriate valid bit(s) are set. It should be noted that the execution of the write combining procedure is transparent to the application program.

In one embodiment, the DCU 300 includes the fill buffers 308. Fill buffers 308 is comprised of multiple lines 309. Each of these multiple lines 309 is divided into state, data, tag, and validity fields. The state for one of these lines can be write combining (WC) 311.

FIG. 4 is a flow chart showing the steps of the write combining procedure of the present invention. The processor continues its execution process until a write combine (WC) store request is encountered, step 401. A WC store request can be generated in several ways. In one Instance, a special write combine uop is used to indicate that a store is to be write combined. In another instance, a particular block of memory is designated for applying the write combining feature. In other words, any stores having an address that falls within that particular block of memory is designated as a write combine store.

Once a write combine store request is received, a determination is made as to whether the request results in a write combining buffer hit, step 402. This is accomplished by comparing the current tag of the write combining buffer with the store's address. If there is a hit (i.e., the two addresses match), the store is made to the write combining buffer, step 403. The corresponding valid bit(s) are set, step 404. The processor then waits for the next write combine store request, step 401. So long as subsequent stores are to that same address (i.e., WC store falls within the 32-byte range), the store continues to be to the write combining buffer.

Otherwise, if it is determined in step 402 that the store results in a miss to the write combining buffer, then steps 405-410 are performed. In step 405, the processor generates a request for allocation. An allocation refers to the assignment of a new value to a tag. This occurs during a line fill operation, wherein the information is transferred into the cache from the next outer level (e.g., from the write combining buffer to the level 1 cache, level 2 cache, or main memory). The old WC buffer (if any) is marked as being obsolete, step 406. The write combining buffer is allocated with the new address, step 407. The valid bits are all cleared, step 408. Next, steps 403 and 404 described above are executed. Furthermore, following step 406, the old contents of the WC buffer are written to memory, step 409. Thereupon, the old WC buffer can be reused, step 410.

FIG. 5 shows the write combining buffer 302 of FIG. 3 in greater detail. It can be seen that the write combining buffer 302 is relatively small. It is comprised of a single line. In FIG. 5, data is written to bytes 0, 2, . . . 29, and 30, as indicated by the shading. Hence, bits 1, 2, . . . 29, and 30 of the validity field 302 are set. All the other valid bits are not set. The tag field 306 specifies the addresses corresponding to the data stored in data field 305.

In alternative embodiments, multiple write combining buffers may be utilized. Furthermore, the present invention can be applied to non-cached as well as single or multiple cached systems as well as write through and writeback caches.

PatentTips - Write Combining Buffer for Sequentially Addressed Partial Line Operations的更多相关文章

Write Combining Buffer
现代CPU使用了很多技术来降低对内存存取数据的延时,因为CPU执行的速度实在是太快了,在从内存存取数据的约120ns中,可以执行数百条指令. 其中多级的缓存架构就是为了减少这种延时,来提高CPU的利用 ...
PatentTips - Optimizing Write Combining Performance
BACKGROUND OF THE INVENTION The use of a cache memory with a processor facilitates the reduction of ...
Method and apparatus for providing total and partial store ordering for a memory in multi-processor system
An improved memory model and implementation is disclosed. The memory model includes a Total Store Or ...
Speculative store buffer
A speculative store buffer is speculatively updated in response to speculative store memory operatio ...
python中 functools模块闭包的两个好朋友partial偏函数和wraps包裹
前一段时间学习了python当中的装饰器,主要利用了闭包的原理.后来呢,又见到了python当中的functools模块,里面有很多实用的功能.今天我想分享一下跟装饰器息息相关的两个函数partial ...
【mysql】Innodb三大特性之insert buffer
一.什么是insert buffer insert buffer是一种特殊的数据结构(B+ tree)并不是缓存的一部分,而是物理页,当受影响的索引页不在buffer pool时缓存 secondar ...
net programming guid
Beej's Guide to Network Programming Using Internet Sockets Brian "Beej Jorgensen" Hallbeej ...
nginx HttpLuaModule
http://wiki.nginx.org/HttpLuaModule#Directives Name ngx_lua - Embed the power of Lua into Nginx This ...
Nginx+lua+openresty精简系列
1. CentOS系统安装openresty 你可以在你的 CentOS 系统中添加 openresty 仓库,这样就可以便于未来安装或更新我们的软件包(通过 yum update 命令).运行下面的 ...

随机推荐

hdu 1181 深搜
中文题深搜许久没写鸟,卡在输入问题上... #include <iostream> #include <string> using namespace std; bool ...
Vue+webpack+echarts+jQuery=demo
需要的插件: "dependencies": { "bootstrap": "^3.3.7", "echarts": & ...
JAVA web项目转客户端（nativefier）
1.环境:windows 2.下载node.js 3.安装mode.js;记住安装目录 4.命令行进入安装目录 5.执行语句: npm install nativefier –g 进行安装 6.新建空 ...
ios lazying load
初步写一些自己对于lazyload的看法吧.这篇文章主要针对普通view,对于image相关的的懒加载,准备过几天研究一下在写. 懒加载,又称为延迟加载.通常用法,你有一个UITextField类型的 ...
51nod——1174 区间中最大的数（ST）
题目链接给出一个有N个数的序列,编号0 - N - 1.进行Q次查询,查询编号i至j的所有数中,最大的数是多少. 例如: 1 7 6 3 1.i = 1, j = 3,对应的数为7 6 3,最大的数 ...
【动态规划】loj#2485. 「CEOI2017」Chase
有意思的可做dp题:细节有点多,值得多想想题目描述在逃亡者的面前有一个迷宫,这个迷宫由 nnn 个房间和 n−1n-1n−1 条双向走廊构成,每条走廊会链接不同的两个房间,所有的房间都可以通过走廊 ...
【单调栈动态规划】bzoj1057: [ZJOI2007]棋盘制作
好像还有个名字叫做“极大化”? Description 国际象棋是世界上最古老的博弈游戏之一,和中国的围棋.象棋以及日本的将棋同享盛名.据说国际象棋起源于易经的思想,棋盘是一个8*8大小的黑白相间的 ...
json数据格式与 for in
格式一: var json1={ name:'json', age:'23' }; json1.name='金毛'; 格式二: (比较安全) 属性名字里有空格或者有连字符‘-’或者有保留字例如‘fo ...
MySQL 之视图、触发器、事务、存储过程、内置函数、流程控制、索引
本文内容: 视图触发器事务存储过程内置函数流程控制索引 ------------------------------------------------------------------ ...
Python之路-基础数据类型之字典集合
字典的定义-dict 字典(dict)是python中唯⼀的⼀个映射类型.他是以{ }括起来的键值对组成,字典是无序的,key是不可修改的.dic = {1:'好',2:'美',3:'啊'} 字典的操 ...

PatentTips - Write Combining Buffer for Sequentially Addressed Partial Line Operations

PatentTips - Write Combining Buffer for Sequentially Addressed Partial Line Operations的更多相关文章

随机推荐

热门专题