### Αρχιτεκτονική Υπολογιστών

Κεφάλαιο #5 (b)

# Συστήματα Μνήμης

#### Διονύσης Πνευματικάτος pnevmati@cslab.ece.ntua.gr

5ο εξάμηνο ΣΗΜΜΥ – Ακαδημαϊκό Έτος: 2019-20 <u>Τμήμα 3 (ΠΑΠΑΔ-Ω)</u>

http://www.cslab.ece.ntua.gr/courses/comparch/

# **Memory Hierarchy Technology**

- Random Access:
  - "Random" is good: access time is the same for all locations
  - DRAM: Dynamic Random Access Memory
    - High density, low power, cheap, slow
    - Dynamic: need to be "refreshed" regularly
  - SRAM: Static Random Access Memory
    - Low density, high power, expensive, fast
    - Static: content will last "forever" (until lose power)
- "Non-so-random" Access Technology:
  - Access time varies from location to location and from time to time
  - Examples: Disk, CDROM, DRAM page-mode access
- Sequential Access Technology: access time linear in location (e.g., Tape)
- The next two lectures will concentrate on random access technology
  - The Main Memory: DRAMs + Caches: SRAMs



### **Main Memory Background**

- Performance of Main Memory:
  - Latency: Cache Miss Penalty
    - Access Time: time between request and word arrives
    - Cycle Time: time between requests
  - Bandwidth: I/O & Large Block Miss Penalty (L2)
- Main Memory is *DRAM* : Dynamic Random Access Memory
  - Dynamic since needs to be refreshed periodically (8 ms)
  - Addresses divided into 2 halves (Memory as a 2D matrix):
    - RAS or Row Access Strobe
    - CAS or Column Access Strobe
- Cache uses SRAM : Static Random Access Memory
  - No refresh (6 transistors/bit vs. 1 transistor) Size: DRAM/SRAM 4-8 Cost/Cycle time: SRAM/DRAM 8-16



# Random Access Memory (RAM) Technology

- Why do computer designers need to know about RAM technology?
  - Processor performance is usually limited by memory bandwidth
  - As IC densities increase, lots of memory will fit on processor chip
    - Tailor on-chip memory to specific needs
      - Instruction cache
      - Data cache
      - Write buffer
- What makes RAM different from a bunch of flipflops?
  - Density: RAM is much denser



# Static RAM Cell



4. Sense amp on column detects difference between bit and bit

# **Typical SRAM Organization: 16-word x 4-bit**



# Logic Diagram of a Typical SRAM

- Write Enable is usually active low (WE\_L)
- Din and Dout are combined to save pins:
  - A new control signal, output enable (OE\_L) is needed
  - WE\_L is asserted (Low), OE\_L is disasserted (High)
    - D serves as the data input pin
  - WE\_L is disasserted (High), OE\_L is asserted (Low)
    - D is the data output pin
  - Both WE\_L and OE\_L are asserted:
    - Result is unknown. Don't do that!!!
- Although could change VHDL to do what desire, must do the best with what you' ve got (vs. what you need)



# Typical SRAM Timing (Asynchronous Handshake)



#### **Problems with SRAM**

- Six transistors use up a lot of area
- Consider a "Zero" is stored in the cell:
  - Transistor N1 will try to pull "bit" to 0
  - Transistor P2 will try to pull "bit bar" to 1
- But bit lines are precharged to high: Are P1 and P2 necessary?
  Select = 1



# **1-Transistor Memory Cell (DRAM)**

- Write:
  - 1. Drive bit line
  - 2.. Select row

# Read (destructive):

- 1. Precharge bit line to Vdd/2
- 2.. Select row
- 3. Cell and bit line share charges
  - Very small voltage changes on the bit line
- 4. Sense (fancy sense amp)
  - Can detect changes of ~1 million electrons
- 5. Write: restore the value

#### Refresh

1. Just do a dummy read to every cell.



# Οργάνωση της DRAM



# **DRAM logical organization (4 Mbit)**



Square root of bits per RAS/CAS

### **DRAM physical organization (4 Mbit)**



**CSLab** 

# Logic Diagram of a Typical DRAM



- Control Signals (RAS\_L, CAS\_L, WE\_L, OE\_L) are all active low
- Din and Dout are combined (D):
  - WE\_L is asserted (Low), OE\_L is disasserted (High)
    - D serves as the data input pin
  - WE\_L is disasserted (High), OE\_L is asserted (Low)
    - D is the data output pin
- Row and column addresses share the same pins (A)
  - RAS\_L goes low: Pins A are latched in as row address
  - CAS\_L goes low: Pins A are latched in as column address
  - RAS/CAS edge-sensitive

# DRAM Read Timing (Asynchronous Handshake)



# DRAM Write Timing (Asynchronous Handshake)



# **Key DRAM Timing Parameters**

- t<sub>RAC</sub>: minimum time from RAS line falling to the valid data output
  - Quoted as the speed of a DRAM
  - A fast 4Mb DRAM t<sub>RAC</sub> = 60 ns
- t<sub>RC</sub>: minimum time from the start of one row access to the start of the next
  - t<sub>RC</sub> = 110 ns for a 4Mbit DRAM with a t<sub>RAC</sub> of 60 ns
- t<sub>CAC</sub>: minimum time from CAS line falling to valid data output
  - 15 ns for a 4Mbit DRAM with a t<sub>RAC</sub> of 60 ns
- t<sub>PC</sub>: minimum time from the start of one column access to the start of the next
  - 35 ns for a 4Mbit DRAM with a t<sub>RAC</sub> of 60 ns



### **DRAM Performance**

- A 60 ns (t<sub>RAC</sub>) DRAM can
  - perform a row access only every 110 ns (t<sub>RC</sub>)
  - perform column access (t<sub>CAC</sub>) in 15 ns, but time between column accesses is at least 35 ns (t<sub>PC</sub>).
    - In practice, external address delays and turning around buses make it 40 to 50 ns
- These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead.
  - Drive parallel DRAMs, external memory controller, bus to turn around, SIMM module, pins...
  - 180 ns to 250 ns latency from processor to memory is good for a "60 ns" (t<sub>RAC</sub>) DRAM

## **Main Memory Performance**



CPU/Mux 1 word;

Mux/Cache, Bus,

64 bits & 256 bits)

Memory N words (Alpha:

a. One-word-wide memory organization

#### • Simple:

CPU, Cache, Bus, Memory same width (32 bits)



c. Interleaved memory organization

#### Interleaved:

CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is *word interleaved* 





DRAM Bandwidth



### **Increasing Bandwidth - Interleaving**



### **Main Memory Performance**

#### Timing model

- 1 to send address,
- 4 for access time, 10 cycle time, 1 to send data
- Cache Block is 4 words
- *Simple M.P.* = 4 x (1+10+1) = 48
- *Wide M.P.* = 1 + 10 + 1 = 12
- Interleaved M.P. = 1+10+1 + 3 = 15





#### **Independent Memory Banks**

- How many banks?
  - number banks ≥ number clocks to access word in bank
    - For sequential accesses, otherwise will return to original bank before it has next word ready
- Increasing DRAM sizes => fewer chips => harder to have many banks
  - Growth bits/chip DRAM : 50%-60%/yr



#### **Fast Page Mode Operation**



### **Summary:**

- Two Different Types of Locality:
  - Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon.
  - Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon.
- By taking advantage of the principle of locality:
  - Present the user with as much memory as is available in the cheapest technology.
  - Provide access at the speed offered by the fastest technology.
- DRAM is slow but cheap and dense:
  - Good choice for presenting the user with a BIG memory system
- SRAM is fast but expensive and not very dense:
  - Good choice for providing the user FAST access time.



# Memory Width, Interleaving: Παράδειγμα

```
Δίνεται ένα σύστημα με τις ακόλουθες παραμέτρους:
```

Mέγεθος Cache Block = 1 word, Memory bus width = 1 word, Miss rate = 3% Miss penalty = 32 κύκλους : (4 κύκλοι για αποστολή της διεύθυνσης, 24 κύκλοι access time / λέξη, 4 κύκλοι για

αποστολή μιας λέξης) Memory access / εντολή = 1.2 Ιδανικό execution CPI (αγνοώντας τα cache misses) = 2 Miss rate (μέγεθος block=2 word) = 2% Miss rate (μέγεθος block=4 words) = 1%

- To CPI του μηχανήματος με blocks της 1 λέξης = 2 + (1.2 x .03 x 32) = 3.15
- Μεγαλώνοντας το μέγεθος του block σε 2 λέξεις δίνει το ακόλουθο CPI:
  - 32-bit bus και memory, καθόλου interleaving = 2 + (1.2 x .02 x 2 x 32) = 3.54
  - 32-bit bus και memory, interleaved =  $2 + (1.2 \times .02 \times (4 + 24 + 8) = 2.86)$
  - 64-bit bus και memory, καθόλου interleaving =  $2 + (1.2 \times .02 \times 1 \times .32) = 2.77$
- Μεγαλώνοντας το μέγεθος του block σε 4 λέξεις, δίνει CPI:
  - 32-bit bus και memory, καθόλου interleaving = 2 + (1.2 x 1% x 4 x 32) = 3.54
  - 32-bit bus και memory, interleaved =  $2 + (1.2 \times 1\% \times (4 + 24 + 16)) = 2.53$
  - 64-bit bus και memory, καθόλου interleaving = 2 + (1.2 x 2% x 2 x 32) = 2.77