ELEMENTS OF CACHE DESIGN

This section provides an overview of cache design parameters and reports some typical results. We occasionally refer to the use of caches in high-performance computing (HPC). Although there are a large number of cache implementation, there are a few basic design elements that serve to classify an differentiate cache architectures. Table 4.2 lists key elements.

Table 1 Elements of Cache Design

Cache Size

The first element, cache size, has already been discussed. We would like the size of the cache to be small enough so that the overall average cost per bit it is close to that of main memory alone and large enough so that the overall average access time is close to that of the cache alone. Table 4.3 list the cache sizes of some current and past processors.

Mapping Function

Because there are fewer cache lines than main memory block, an algorithm is needed for mapping main memory blocks into cache lines. Three techniques can be used: direct, associative, and set associative. For all three cases. The example includes the following elements:

The cache can hold 64 Kbytes.

Data is transferred between main memory and cache in blocks of 4 bytes each. This means that the cache is organized as 16K = 2¹⁴ lines of 4 bytes each.

The main memory consists of 16 Mbytes, with each byte directly addressable by a 24-bit address (2²⁴ = 16M). Thus, for mapping purposes, can consider main memory to consist of 4Mblocks of 4 bytes each.

Because there are fewer cache lines than main memory blocks, an algorithm is needed for mapping main memory blocks into cache lines.

Three techniques can be used: direct, associative, and set associative

Direct mapping, maps each block of main memory into only one possible cache line. Figure 4.7 illustrates the general mechanism. The mapping is expressed as

i = j modulo m

Where i = cache line number
j = main memory block number
m = number of lines in the cache

Figure 3.1 Direct-Mapping Cache organization[HWAN93]

The effect of this mapping is that blocks of main memory are assigned to lines of the cache as follows

Figure 3.2 shows our example system using direct mapping. In the example, m = 16K = 2²⁴ and I = j modulo 2²⁴. The mapping becomes as follows

Figure 3.2 Direct Mapping Example

The direct mapping technique is simple and inexpensive to implement. Its main disadvantage is that there is a fixed cache location for any given block.

Associative mapping overcomes the disadvantage of direct mapping by permitting each main memory block to be loaded into any line of the cache. Figure 3.3 shows our example using associative mapping.

Figure 3.3 Associative Mapping Example

With associative mapping, there is flexibility as to which block to replace when a new block is read into the cache. Replacement algorithms, discussed later in this section, are designed to maximize the hit ratio.

Set associative mapping is a compromise that exhibits the strengths of both the direct and associative approaches while reducing their disadvantages. The relationships are

m = v x k

i = j modulo v

Where i = cache set number
j = main memory block number
m = number of lines in the cache

Figure 3.5 shows our example using set associative mapping with two lines in each set, referred to as two-way set associative

Replacement Algorithms

When a new block is brought into the cache, one of the existing blocks must be replaced. A number of algorithms have been tried: We mention four of the most common. Probably the most effective is least recently used(LRU): Replace that block in the set that has been in the cache longest with no reference to it.

Figure 3.4 k-Way Set Associative Cache Organization

Figure 3.5 Two-Way Set Associative Mapping Example

For 2-way set associative, this is easily implemented. Each line includes a USE bit. When a line is referenced, its USE bit is set to 1 and the USE bit of the other line in that set in set to 0. When a block is to be read into the set, the line whose USE bit is 0 is used. LRU should give the best hit ratio. First-in-first-out (FIFO): Replace that block in the set that has been in the cache longest. FIFO is easily implemented. Least frequently used(LFU): Replace that block in the set that has experienced the fewest references. A technique not based on usage is to pick a line at random from among the candidate lines.

Write Policy

Before a block that is resident in the cache can be replaced, it necessary to consider whether in has been altered in the cache but not in main memory. If it has not, then the old block in the cache may be overwritten. A variety of write policies, with performance and economic trade-offs, is possible. There are 2 problems to contend with. First, more than one device may have access to main memory. A more complex problem occurs when multiple processors are attached to the same bus and each processor has its own local cache. Then, if a word is altered in one cache, it could conceivably invalidate a word in other cache.

The simplest technique is called write through. Using this technique, all write operation are made to main memory as well as to the cache, ensuring that main memory is always valid. Any other processor-cache module can monitor traffic to main memory to maintain consistency within its own cache.

In a bus organization in which more than one device has a cache and main memory is shared, a new problem is introduced. Even if a write-through policy is used, the other caches may contain invalid data. A system that prevents this problem is said to maintain cache coherency. Possible approaches to cache coherency include the following.

Bus watching with write through: Each cache controller monitors the address lines to detect write operation to memory by other bus masters.
Hardware transparency: Additional hardware is used to ensure that all updates to main memory via cache are reflected in all caches.
Noncacheable memory: Only a portion of main memory is shared by more than one processor, and this is designated as noncacheable.

Line Size

Larger blocks reduce the number of blocks that fit into a cache. Because each block fetch overwrites older cache contents, a small number of blocks results in data being overwritten shortly after they are fetched.

As a block become larger, each additional world is farther from the requested word, and therefore less likely to be needed in the near future.

Number of Caches

When cache were originally introduced, the typical system had a single cache.more recently, the use of multiple caches has become the norm. 2 aspects of this design issue concern the number of levels of caches and the use of unified versus split caches.

Multilevel Caches

As logic density has increased, it has become possible to have a cache on the same chip as the processor: the on-chip cache. Because of the short data paths internal to the processor, compared with bus lengths, on-chip cache accesses will complete appreciably faster than would even zero-wait state bus cycle.

The inclusion of an on-chip cache leaves open the question of whether an off-chip, or external, cache is still desirable. The simplest such organization is known as a two-level cache, with the internal cache designated as level 1 (L1) and the external cache designated as level 2 (L2). Because if there is no L2 cache and the processor makes an access request for a memory location not in the L1 cache, then the processor must access DRAM or ROM memory across the bus. Due to the typically slow bus speed and slow memory access time, this results in poor performance. if an L2 SRAM (static RAM) cache is used, then frequently the missing information can be quickly retrieved. If the SRAM is fast enough to match the bus speed, then the data can be accessed using a zero-wait state transaction, the fastest type of bus transfer.

Two features of contemporary cache design for multilevel caches are noteworthy. First, for an off-chip L2 cache, use a separate data path, so as to reduce the burden on the system bus. Second, with the continued shrinkage of processor components.

The potential savings due to the use of an L2 cache depends on the hit rates in both the L1 and L2 caches. Several studies have shown that, in general, the use of a second-level cache does improve performance (e.g., see [AZIM92], [NOVI93], [HAND98]).

Unified Versus Split Caches

When the on-chip cache first made an appearance, many of the designs consisted of a single cache used to store references to both data and instructions. It has to split the cache into two: one dedicated to instructions and one dedicated to data.

There are two potential advantages of a unified cache:

For a given cache size, a unified cache has a higher hit rate than split caches because it balances the load between instruction and data fetches automatically.
Only one cache needs to be designed and implemented.

Despite these advantages, the trend is toward split caches, particularly for superscalar machines such as the Pentium and PowerPC, which emphasize parallel instruction execution and the prefetching of predicted future instructions. The key advantage of the split cache design is that it eliminates contention for the cache between the instruction fetch/decode unit and the execution unit.This is important in any design that relies on the pipelining of instructions.

Reference: William Stallings. (2003). Computer Organization & Architecture DESIGN FOR PERFORMANCE(6th ed.): Elements of Cache Design(pp 108-119). Upper Saddle River, NJ: Pearson

Computer Architecture - Cache Memory

Tuesday, February 7, 2017