CS61C Review Doc

Created by Yunhao Cao (Github@ToiletCommander) in Fall 2021 for UC Berkeley CS61C.

Reference Notice: Material highly and mostly derived from Prof Wawrzynek & Weaver's lecture slides, some ideas were borrowed from wikipedia & CS-Illustrated Berkeley.

C and Memory Representation

Number Representation

Integer

Convert Number Radices

We have number 159 in decimal, how do we convert this into binary?

159=028+127+026+025+124+123+122+121+120=128+16+8+4+2+1159 = 0*2^8 + 1*2^7 + 0*2^6 + 0*2^5 + 1 * 2^4 + 1*2^3 + 1*2^2 + 1*2^1 + 1*2^0 = 128 + 16 + 8 + 4 + 2 + 1

Thus 159 in binary is 0b010011111, I left the leftmost 0 there to avoid confusion with negative numbers.

Then what is 159 in hexidecimal? I love to view it starting from binary perspective

159 = 0b1001.1111

You see that I removed the front 0 because each 4 bit corresponds to a hexidecimal character.

159 = 0x9F since 0b1001 = 9 in decimal, and 9 in dec. corresponds to 9 in hex, and 0b1111 = 15 in decimal, and 15 corresponds to F in hex.


Two's Complement

It's represented by:

MSB(bit) as a sign bit ⇒ 0 representing positive and 1 representing negative.

If MSB is 0, then just treat it as unsigned integer

If MSB is 1, then the number is (2nk)-(2^{n}-k) where k is the rest bit positions interpreted as unsigned integer.

This representation can take values between 2n1-2^{n-1} to (2n11)(2^{n-1}-1)

The -1 on the right bound was to account for existance of 0

For example, if I have a 16-bit integer, a value of 0b1111111111111111(16 of 1)

we would interpret it as k=0b111111111111111=32767k = 0b111111111111111 = 32767, and the value of it being val=(216k)=(3276832767)=1val = -(2^{16}-k) = -(32768-32767)=-1

Easier two ways to convert complex two's negative values:

  1. flip the bits first and add 1 to the final result.
  1. minus the number by 1 first and then flip the bits.

The two methods both work!


Signed Magnitude

MSB(bit) is a sign bit ⇒ 0 representing positive and 1 representing negative

The rest bit positions can be interpreted as unsigned integer representing the magnitude(absolute value).


Bias

Actual value is the binary(unsigned) value plus a fixed bias

bias = -127 ⇒ actual number is bin value with -127 added to it

0b00000000 → -127

0b11111111 → 128

Integer Multiplication

m bits * n bits = m + n bit product

Integer Division

Floating Point

a.bc×rea.bc \times r^{e} ⇒ a is called mantissa, the dot is called binary(or decimal, depending on the representation of the number) point, and r is called radix, or base, and e is called exponent

for example, 1.01×211.01 \times 2^{-1} has a mantissa of 1, a radix of 2 and an exponent of -1

Floating point standard is invented by Prof. Kahan in UC berkeley!

We have:

  1. 1 bit for sign (0 ⇒ +, 1 ⇒ -)
  1. e bits for exponent(E) with exponent bias of 2e112^{e-1}-1, so ranges from (2e11)-(2^{e-1}-1) to 2e12^{e-1}
  1. (n-e-1) bits for fraction(F)

Single Precision Standard(32b) ⇒ 1 bit for sign, 8 bits for exponent, 23 bits for significand, exponent bias of -127

Double Precision Standard(64b) ⇒ 1 bit for sign, 11 bits for exponent, 52 bits for fraction, exponent bias of -1023

We will get 1 extra bit of precision because leading 1 is implicit (cuz if the leading mantissa is 0, then we can simply decrement the exponent...)

Common Formula (for single precision):

(1)S×1.Significand×2(Exponent127)(-1)^{S} \times 1.Significand \times 2^{(Exponent - 127)}

Here we see Exponent is a biased integer representation, while significand is an unsigned integer representation

Special Cases

SExponentSignificandDescription
00000000000000000000000000000000Positive Zero - the number is too small to represent and either zero or somewhere between 0 and our smallest number
10000000000000000000000000000000Negative Zero - the number is too small to represent and either zero or somewhere between 0 and our smallest negative number
X1111111100000000000000000000000+/- Infinity
X11111111None ZeroNaN
X00000000None ZeroDenorm

NaN?

Denorm - Denormalized Numbers

All zero in exponent bit positions, nonzero in significand.

No implied leading 1, but implicit exponent = (2e11)+1-(2^{e-1}-1) + 1 ⇒ -126 in 32b, -1022 in 64b ⇒ so the exponent behavior is really being 000000001 in the exponent bits, but the implicit leading 1 is deleted and replaced by an implicit leading 0.

Pointer and Reference

int *p; //variable p is address of an int
p = &y; //assign address of y to p
z = *p; //assign value at address in p to z

addresses are unsigned integer values.

Managing Heap

  1. malloc(size) returns pointer to uninitialized memory
  1. calloc(size, number) returns pointer to zeroed memory
  1. free(pointer) frees allocated memory
  1. realloc(pointer, new_size) returns new pointer to resized memory
    1. Be careful that the new pointer may be different form the old pointer, so if there are other pointers pointing to the same block of memory, you're dead.

Memory Address

Each byte has an unsigned integer address

Commonly thin in terms of words, so 32b for a word in 32b architecture, 64b for a word in 64b architecture, a word is big enough to hold an address

Types

typedef struct {
	int x; //int memory aligned each 4 bytes
	int y; //int memory aligned each 4 bytes
} Point;
Point p1;
Point *paddr = &p1;
//we can use p1.x or paddr->x
union foo{ //provides enough space for the largest element
	int a;
	char b;
	union foo *c;
}

union foo f;
f.a = 0xDEADB33F; //treat f as an integer and store the value
f.c = &f; //treat f as a pointer and store the address of f itself.

Default alignment rules (32b architecture):

  1. char - a byte, no alignment needed
  1. short - 2 bytes, 1/2 word aligned,
  1. int - 4 bytes, word aligned
  1. pointers are same size as ints

Arrays

int ar[2];
int ar[] = {100, 200};

Array variable is simply a "pointer" to the first(0th) element

so ar[0] = *ar, ar[i] = *(ar + i)

Some weird thing:

#include <iostream>

int main(int argc, char** args){
    int arr[2];
    std::cout << arr << std::endl;
    std::cout << &arr << std::endl;
}

gets the same result, so &ar = ar

Computer Structures

Very Important Idea: Abstraction - make components into black boxes instead of knowing exactly how each of them work!

We will focus on synchronous digital systems this semester.

Synchronous means all operations and communications in the system are coordinated by a central clock. Compared to asynchronous system, which requires local coordination between communication of components, our synchronous systems will be much easier to design & debug.

Digital means we will represent all values using "discrete" values.

We use Binary in our system because it has good noise immunity, and Moore's law is only possible because of binary representation.

Binary Signals

Hardware signals are combined using primitive operators to implement the capabilities we need for executing ISA instructions

A TT(Truth Table) defines a function ⇒ enumerate each output value for each input combination

aba op b
000
010
100
111

Bit Operations

aba \cdot b = a AND b

a + b = a OR b

aba \oplus b = a XOR b

brackets ⇒ determines calculation priorities

!a or ~a or a with a line above = NOT a

Some useful laws of boolean algebra, exhausive proof (perfect induction) is a good way to prove these
All logic gates, reference: http://www.exclusivearchitecture.com/?page_id=2425

Registers

cMOS register circuits in common use are "edge-triggered", meaning they'll take action based on the rising or falling edge of the clock. We'll assume rising edge for consitency.

1-bit register is called "flip-flop"

We know that registers "hold on" to d values until change in clk causes register to capture new d value and hold on to that new d value and output it in q.

We'll define the following properties of register:

  1. setup time (τsetup\tau_{setup}) - the time that d must have hold stable before the clock edge
  1. clock to q time / register output delay(τclktoq\tau_{clk-to-q}) - the delay from clk edge change until output value changes
  1. hold time(τhold)\tau_{hold}) - the time that d must hold after the clock edge

Physical Limitations (Of Processors)

For CMOS....

  1. They leak when off
  1. They have finite resistance when on
  1. All circuit nodes have capacitance, so to change their voltage level we must displace charge first
  1. So for every logic gate, we'll have delay from input change to output change

Energy

To switch on/off every transistor, we see above that it takes energy to do so.

We know P=dEdtP = {{dE}\over{dt}}, and we know for capacitors, E0to1=12CVdd2E_{0-to-1} = {1\over2} \cdot C \cdot V_{dd}^2 and E1to0=12CVdd2E_{1-to-0} = {1\over2} \cdot C \cdot V_{dd}^2

So switching power Psw=12αCVdd2FP_{sw} = {1 \over 2} \cdot \alpha \cdot C \cdot V_{dd}^2 \cdot F

where:

α\alpha is the "activity factor", average percentages of capacitance switching per cycle (~ number of nodes to switch)

CC is the total chip capacitance to be switched

VddV_{dd} is the operating voltage

FF is the clock frequency

We can decrease clock frequency to reduce power, but doesn't improve energy efficiency because we would have to run longer to finish our computation

We know EswE_{sw} is proportional to Vdd2V_{dd}^2, EswVdd2E_{sw} \propto V_{dd}^2.

but, τlogiVdd\tau_{logi} \propto V_{dd} (we would charge slower if VddV_{dd} is lowered)

We can improve energy efficiency by lowering supply voltage and making up for less performance by using parallelism.

energyprogram=instructionsprogramenergyinstruction{energy \over program} = {instructions \over program} \cdot {energy \over instruction}

energyprograminstructionsprogramCVdd2{energy \over program} \propto {instructions \over program} \cdot CV_{dd}^2

Capacitance is dependent on technology, microarchitecture, circuit details,

VddV_{dd} is supply voltage

We want to both reduce capacitance and supply voltage to reduce energy per task

Energy efficiency is super important (key metric) in all computing devices because...

  1. for power-constrained systems (datacenter), need better energy efficiency to get more performance at same power.
  1. for energy-constrained systems (phone), need better energy efficiency to prolong battery life.

Performance(taskssecond)=power(Joulesseconds)energy_efficiency(tasksJoule)Performance({tasks \over second}) = power({Joules \over seconds}) \cdot {energy\_efficiency({tasks \over Joule})}

Recent years industry hasn't been able to reduce supply voltage much, it reducing it further would mean increasing "leakage power" where transistor switches don't fully turn off

Size of transistors and capacitance not shrinking as much as before - we are hitting the "power wall" (功耗墙)

Performance

  1. Latency(延迟) - execution time for each instruction
  1. Throughput(吞吐量) - total number of instructios executed per unit time
  1. Energy Efficiency(能效) - Energy per instruction

Iron Law of Processor Performance:

timeprogram=instructionsprogramcyclesinstructiontimecycle{time \over program} = {instructions \over program} \cdot {cycles \over instruction} \cdot {time \over cycle}

  1. instructionsprogram{instructions \over program} is determined by
    1. task specification
    1. algorithm (e.g. O(N2)O(N^2) vs. O(N)O(N))
    1. Programming Language
    1. Compiler
    1. Instruction Set Architecture(ISA)
  1. cyclesinstruction{cycles \over instruction} is determined by
    1. ISA
    1. Processor Implemention (or microarchitecture)
      1. CPI (Clock Per Instruction)
        1. Pipelined Processors, CPI > 1
        1. Superscalar Processors, CPI < 1
  1. timecycle{time \over cycle} is determined by
    1. processor microarchitecture
    1. technology (5nm vs. 14nm)
    1. spply voltage (lower voltage reduces transistor speed, but improves energy efficiency)

Pipelining 指令管线化

It increases throughput(by overlapping execution of multiple instructions) but can never improve latency

But we will eventually run into hazards

  1. Structural Hazard
    1. Two or more instructions in the pipeline compete for the same physical resource
    1. Solved by either
      1. Instructions take turns to use resource (which means some instructions have to stall)
      1. Add more hardware (Yeah, I have money!)
    1. e.g. Regfile Structural Hazard
      1. each instruction can read up to two operands in decode stage and write one value in writeback stage
      1. So avoid structural hazard by having two independent read ports and one independent write port
      1. So reads from one instruction and writes from another can happen simultaneously.
    1. e.g. Memory Access
      1. In DM and IM stage, Instruction and Data Memory are used simultaneously
      1. can be solved by using two separate memories, I$ and D$ (Indeed we would use two separate first-level caches)
    1. RISC ISAs are designed to avoid structural hazards
      1. at most one memory access per instruction
      1. limited operands per instruction
  1. Data Hazard
    1. Register Access
      1. We already have separate ports, but what happens if we write to the same value as we read?
      1. We can exploit high speed of register file (100ps), let WB update value first then ID reads new value ⇒ this is shown by the shading of the pipelining diagram
      1. However, if we're working in high-frequency designs, this might not be possible
    1. ALU Result
      1. Say we add 1 to t0 in first instruction, and add 2 to t0 in second instruction
        1. However, the value of t0 after first instruction would not have been written back by the first instruction until we reach the 5th cycle (when WB stage of the first instruction is finally active)
        1. So basically the second instruction is reading wrong register values when it reaches its register read stage
      1. We can stall the instruction but it reduces performance
      1. Or the compiler can try to arrange code to avoid hazards and stalls, but requires knowledge of the pipeline structure
      1. Or ⇒ We add Data Forwarding!
        1. ALU result from one pipeline to another
        1. This requires modifications in datapath as well as in control logic(See Below)
    1. Load
      1. There are cases when stalls are unavoidable
        1. Slot after a load ⇒ Load Delay Slot
        1. if use the result of the load in load delay slot, there's an unavoidable NOP ⇒ repeat and instruction and forward
        1. but we can use unrelated instruction into load delay slot ⇒ no performance loss!
  1. Control Hazard
    1. When a branch statement is executed, 3 lines of asm code after branch statement is also loaded into pipeline, and those statements could be executed regardless of branch outcome if not turned into NOP.
    1. Every taken branch in simple pipleine costs 2 dead cycles
    1. So we use branch prediction(分支预测) to guess which way branch will go
      1. We will keep a branch prediction buffer / cache ⇒ small memory addressed by the lowest bits of PC
      1. During Instruction Decode ⇒ Look up whether branch was taken last time?
        1. If yes, compute PC + offset and fetch that
        1. If no, stick with PC + 4
        1. If branch hasn't been seen before
          1. assume forward branches are not taken, backward branches are taken

Superscalar Processor 超标量处理器

  1. We have multiple pipeline hardwares per stage
    1. Multiple execution units for additional instruction level parallelism
    1. Performance benefit highly code dependent
  1. Start multiple instructions per clock cycle
  1. CPI < 1 (think about this, we have multiple datapath hardwares running at same time)
  1. since CPI < 1, we will use Instructions Per Cycle (IPC)
  1. Out-of-Order Execution 乱序执行
    1. Recorder instructions dynamically in hardware to reduce impact of hazards: EG, memory/cache misses.

Memory Cache

Very important to notice that the performance gap between CPU and DRAM, and it is usually impossible to buy RAMs that are BIG and FAST using a little amount of money. The solution is to use cache!

Big Idea is to use Locality 局部性 - The idea that memory access usually happens around the same place in a given period

Principle of Locality states that programs access small portion of address space at any instant of time (spatial locality) and repeatedly access that portion (temporal locality)

We have two types of locality

  1. Temporal Locality 时间局部性
    1. If memory location is referenced, then it will tend to be referenced again soon
  1. Spacial Locality 空间局部性
    1. If a memory location is referenced, the locations with nearby addresses will tend to be referenced soon

So cache uses the fact that the program uses memory and exhibits those locality characteristics...

And it...

  1. Give illusion of speed of fastest memory with size of largest memory
  1. However, if you overwhelm the cache your performance may drop off a cliff.
  1. Now processor instead of going to the memory and ask for data it will go to the cache and ask for data
    1. Processor asks for data in 0x12F0
    1. cache checks if has copy of data at address 0x12F0
      1. If yes, return the data to processor
    1. If not, cache asks for 0x12F0 from memory and stroes that value in cache
    1. Then cache returns data to processor.

How Cache uses Processor address:

Tag(MSB - n+o)Set Index(n+o - o)Block offset(o - LSB)
Remaining Portion of Processor AddressBit length determined by number of setsByte address within block, bit length determined by block size.

Organization of sets and blocks:

  1. Directly Mapped
    1. Associativity = 1
    1. Set # of Sets = # of Blocks
    1. Requires only 1 comparator
  1. Fully associative
    1. Associativity = # of blocks
    1. One set per cache ⇒ Fetched memory can go anywhere
    1. No index field, 1 comparator per block
  1. N-way Set Associative
    1. Associativity = N, N places for a block (every set contains N blocks)
    1. # of sets = number of blocks / N
    1. N comparators

Total Cache Capacity = Ccache=Associativity×#ofsets×block_sizeC_{cache} = Associativity \times {\# of sets} \times {block\_size}

Replacement Policy:

When miss occurs, which way is a block selected for replacement?

  1. Least Recently Used (LRU): one that has been unused the longest
    1. Must track when each way's block was used relative to other blocks in the set
    1. Example Simple "Psuedo" Implemention:
      1. Hardware replacement pointer points to one cache entry
      1. Whenever access is made to the entry the pointer points to the next entry
      1. Otherwise, don't move the pointer
      1. It's actually a "not-most-recently-used" policy
  1. Random Replacement
    1. Choose a random block and evict it.

Types of Cache Miss

  1. Compulsory (强制性失误), aka cold start / first reference miss
    1. First access to a block
  1. Capacity (空间性失误)
    1. Cache cannot contain all blocks accessed by the program
    1. Misses that would not occur with infinite cache
  1. Conflict (冲突性失误) aka collision miss
    1. Multiple memory locations mapped to same cache set
    1. Misses that would not occur with ideal fully associative cache
  1. Coherency (连贯性失误) ⇒ Only if sharing data between two processor cores
    1. Share a cache line between two processor cores
      1. every time one does a write the other will take a cache miss
      1. Even if writing to different parts of the cache line
    1. Everyone's reading is fine

Write policy:

  1. Cache Hit
    1. Write-through
      1. Write cache and write the memory
        1. Very slow, so include a "write buffer" to allow processor to continue once data is stored in the write buffer.
        1. Buffer will update the data in parallel with the processor.
    1. Write-back
      1. Write only to cache (dirty bit = 1) and only back to memory when the block has to be evicted from cache.
  1. Cache miss:
    1. No-write-allocate: only write to main memory
    1. Write-allocate (fetch on write): fetch into cache

Some extra information stored in cache

  1. Valid bit
    1. When program start, cache does not have valid information for this program
    1. Need an indicator whether this tag entry is valid
  1. Dirty-bit (Write-back policy)
    1. If data in this cache has changed
  1. Shared-bit (MOESI policy)
    1. If this data is official / only / shared copy

Cache Coherency Policy - MOESI

Performance Measures

  1. Hit rate: fraction of accesses that hit in the cache
  1. Miss rate: 1 - Hit rate
    1. Global Miss Rate - the fraction of references that miss some level of a multilevel cache
      1. misses in this ache divided by total number of memory accesses generated by the CPU
      1. the fraction of references to one level of a cache that miss
  1. Miss penalty: time to replace a block from lower level in memory hierarchy to cache
  1. Hit time: time to access cache memory (including tag comparison)
  1. Average Memory Access Time(AMAT) = HitTime+MissRate×MissPenalty{Hit Time} + {Miss Rate} \times {Miss Penalty}

Cache Design Space

Several Interacting Dimensions including cache size, block size, associativity, replacement policy, write policy, etc.

Optimal choice is a compromise and depends on access characteristics ⇒ simplicity often wins

Design Choices

  1. Increasing Associativity
    1. Hit time increases with large step from DM to ≥ 2 ways
      1. Since we need to mux correct way to processor
    1. Hit time slightly increases for further increase in associativity
    1. Miss rate goes down due to reduced from conflict misses
      1. But most gain is from 1→2→4 way with limited benefit from higher associativities
    1. Miss penalty mostly unchanged, since replacement policy runs in parallel with fetching missing line from memory
  1. Increasing # of entries
    1. Hit time increases since reading tags and data from larger memory structures
    1. Miss rate goes down due to reduced capacity and conflict misses
      1. Miss rate drops ~2x for every ~4x increase in capacity
    1. Miss penalty unchanged
    1. but at some point, increase in hit time may overcome the improvement in hit rate, yielding a decrease in performance
  1. Increasing block size
    1. Hit time unchanged but might be slightly reduced as number of tags is reduced
    1. Miss rate goes down at first due to spatial locality, then increases due to increased conflict misses (fewer blocks in cache)
    1. Miss panelty rises with larger size, but with fixed constant initial latency that is amortized over whole block.
  1. Add a "victim cache"
    1. Small fully associative cache that holds the last few evicted cache lines

Operating System

  1. Runs before any user programs (after BIOS and boot loader) when computer is first turned on, and intermittently runs as long as computer is on.
  1. Finds and controls all I/O devices in the machine in a general way
    1. Relying on hardware-specific "device drivers"
  1. Starts Services(100+)
    1. File System,
    1. Network Stack (Ethernet, WIFI, Bluetooth...)
  1. Loads, runs and manages programs
    1. Multiple programs at the same time (time-sharing)
    1. Isolate programs from each other (isolation)
    1. Multiplex resources between applications (e.g. devices)

Sharing Of Resources

  1. OS gives each process isolation even when multiple processes share the same hardware resources
  1. Each process has the view that it "owns" the whole machine when it is running
  1. Share time on the CPU: Context Switch
    1. Change from one process to another on the CPU Core
    1. Save and restore the state of current process to pick up where it is left off (running status ⇒ runnable status)
  1. Share space in memory: Virtual Memory
    1. Each process has the "illusion" of access to the full address space
    1. One process cannot see what another process has stored in memory
  1. Requires following from hardware
    1. Memory translation
      1. Each running process has a mapping from "virtual" to "physical" addresses that are different for process
      1. When doing load/store, the program issues a virtual address, but actual memory stored is a physical address
    1. Protection and privilege
      1. Split the processor into at least two running modes: "User" and "Supervisor"
      1. Lesser privilege cannot change its memory mapping
      1. But Supervisor can change the mapping for any given program, and also has its own set of mapping of virtual ⇒ actual
    1. Traps & Interrupts
      1. A way of going into Supervisor mode on demand
    1. CSR Registers ⇒ "Control and Status Registers"
      1. CSRRW rd rs csr means
        1. read the old value of the specific control and status register and put it into rd
        1. If rs ≠ x0, place the new value in the CSR
      1. They are sed to communicate requests with the hardware
      1. The hardware enforces privileges, so program running at User level cannot change Supervisor-level CSRs.

Traps / Interrupts / Exceptions

  1. Interrupt
    1. Caused by an event external to current running program
    1. e.g. Key press, disk I/O
    1. Asynchronous to current program, we can handle interrupt on any convenient instuction
      1. "Whenever it's convenient, just don't wait too long"
  1. Exception
    1. Caused by some event during execution of one instruction of current running program
    1. e.g. Memory Error, Bus Error, Illegal Instruction, Raised Exception
    1. Synchronous
      1. Must handle exception precisely on instruction that caused the exception
      1. "Drop whaever you are doing ad act now"
  1. Trap
    1. Action of servicing interrupt or exception by hardware jump to "interrupt or trap handler" code

Trap Handler's View

  1. View of machine state is that every instruction prior to the trapped one has completed, and no instruction after the trap has executed.
  1. Implies that handler can return from an interrupt by restoring user registers and jumping back to interrupted instruction
    1. Interrupt handler software doesn't need to understand the pipeline of the machine, or what program was doing
    1. More complex to handle exception by interrupt
  1. Providing precise traps is tricky in a pipelined superscalar out-of-order procecssor!

Hardware Action

  1. Let's say we're running program A and all of a sudden in MA(memory access) stage program A did a invalid memory read!
  1. Hardware first flush instructions currently in pipeline (convert to nops or "bubbles")
  1. Then it adjust the privilege level
  1. Then it disables interrupts
    1. We don't want to get interrupted when handling an interrupt
  1. Write the old program counter into the sepc CSR
    1. It's the PC that triggered the exception (or first instruction that hasn't yet executed if an interrupt)
  1. Write the reason into the scause CSR
  1. Set the PC to the value in the stvec CSR
    1. This is the address of the "trap handler" ⇒ Single function that handles ALL exceptions and interrupts

Software Action

  1. Save all the registers
    1. Intent is to make the previous program think that nothing whatsoever actually happened!
    1. Steps
      1. Suervisor mode has a sscratch CSR
        1. Use it to point to a piece of memory to store things for the trap handler
      1. Swap x1 for sscratch
        1. csrrw x1 x1 sscratch
      1. Now save all the other registers into that location
        1. sw x2 4(x1)
        1. sw x3 8(x1)
        1. ...
      1. Store the PC from the sepc CSR
        1. csrrw x2 x0 sepc
        1. sw x2 124(x1)
      1. finally save x1 and restore sscratch
        1. csrrw x2 x1 sscratch
        1. sw x2 0(x1)
  1. Figure out what the exception or interrupt is
    1. Read the appropriate CSRs and other pieces to do what is necessary
  1. Restore all the registers
    1. Restore the value for sepc
      1. If ECALL, increment by 4 to make it look like a function call
      1. otherwise just redo the instruction that triggered the exception
    1. Swap x1 temporarily using sscratch
    1. if an ECall, set a0 to the returned value
  1. Return to the right point in execution
    1. execute the SRET instruction
    1. back to the hardware

Hardware Action Again

  1. Re-enable interrupts
    1. Now we're done with trap handler we can get interrupted again
    1. Reset back down to user level
    1. Restore the PC to the value of sepc

Now the progra continues on like nothing ever happened.

Caches probably got trashed ⇒ for security reasons

Context Switch

  1. Hardware provides the OS an interrupt - "timer interrupt"
    1. At a regular interval
  1. Whe triggered, trap handler can execute a context swtich
    1. Take those saved registers that were stored in the area pointed to by sscratch
    1. copy them to a bookkeeping data structure for current process(Process Control Block)
    1. copy the satp(table pointer) value to that data structure so we know its memory mapping
  1. Pick some other process's data structure
    1. Deetermined by the "scheduler" (调度器) ⇒ See CS162
    1. Load the process's registers, satp, sepc, etc.
    1. Tell the caches to flush themselves
      1. Needed for proper isolation
      1. We'd be taking a ton of misses anyway since the new process has no temporal locality with the old process
  1. return with sret

I/O

Options

  1. Special input/output instructions & hardware
  1. Memory mapped I/O
    1. portion of address space dedicated to I/O
    1. I/O device registers there (no memory)
    1. Use load/store instructions

I/O Models

Common I/O devices neither deliver nor accept data matching processor speed

So we have two models(Programmed IO)

  1. Polling
    1. Processor checks status before acting
    1. Device registers generally serve two functions
      1. Control Register - says it's OK to read/write (IO Ready)
      1. Data Register - contains data
    1. Processor reads from Control Register in loop
      1. Waiting for device to set Ready bit in Control Register (0 → 1)
    1. Processor then loads from (input) or writes to (output) data register
  1. I/O Interrupt
    1. Interrupt when IO is ready or needs attention
      1. Interrupt current program
      1. Transfers control to the trap handler in the operating system
    1. If there's a lot of IO,
      1. We are spending a lot on context switch, flsuhing caches, pipeline flush, etc.
  1. Both not ideal because
    1. Device speeds don't align well with CPU speeds
    1. Energy cost of using beefy general-purpose CPU where simpler hardware would suffice
    1. So comes Direct-Memory-Access (DMA)

Real World(without DMA):

  1. Low data rate
    1. we should use interrupts because overhead of interrupts ends up being low
    1. but in practice, USB hardware only supports polling
  1. High data rate
    1. Start with interrupts
      1. If there's no data, we don't do anything
    1. Once start getting data
      1. We start polling
      1. Or we use Direct Memory Access (DMA) ⇒ The device just writes the data into memory directly.

DMA:

  1. Contains CSR registers written by CPU
    1. Memory address to write/read data
    1. # of bytes
    1. I/O device #, direction of transfer
    1. unit of transfer, amount to transfer per burst

DMA: Incoming Data

  1. Receive Interrupt from device
  1. CPU takes interrupt, initiates transfer
    1. Instructs DMA engine to place data at certain address
  1. DMA engine handle the transfer
    1. CPU execute other things
  1. Upon completion, Device/DMA engine interrupts the CPU

DMA: Outgoing Data

  1. CPU decides to initiate transfer, confirms that external device is ready
  1. CPU initiates transfer
    1. Instructs DMA engine that data is available at certain address
  1. DMA engine handle the transfer
    1. CPU is free to execute other things
  1. DMA engine interrupts the CPU again to signal completion

Where to place DMA engine?

Since DMA messes around with memory, where in the memory hierachy do we plug in the DMA engine?

  1. Between L1 and CPU?
    1. Free coherency ⇒ means our memory and cache will stay consistant
    1. Trash the CPU's working set with transferred data
  1. Between last-level cache and main memory
    1. Don't mess with caches
    1. But need to explicitly manage coherency
  1. Or just treat like another node in a multiprocessor
    1. what modern computers do
    1. DMA engine just acts like another processor for the cache coherence mechanisms

Virtual Memory

Supervisor mode alone isn't enough, we need applications to be able to access only their own memories.

Virtual Mem is good because...

  1. Protection & Privacy
    1. Each user have their own private address space and one or more shared address psaces
  1. Demand Paging
    1. provides the ability to run programs larger than primary memory
    1. But the system might start thrashing (repeatedly copy data to and from disk) when your working set exceeds physical memory.

A Processor-generated address can be split into:

Page NumberOffset

Space required by the page tables (PT) is proportional to the address space, number of users, ... It is too large to keep in CPU registers

So we'll keep PTs in the main memory ⇒ But this requires two references for each memory access, one for retrieve the page base address and anotheer to access the data word

If an instruction references a memory page that isn't in DRAM

  1. We get an exception of type "page fault"
  1. Page fault handler
    1. If no unused page is available, a page currently in DRAM is selected to be replaced
      1. Replaced page is written to disk, PTE that maps this VPN ⇒ PPN is marked with DPN
    1. Virtual page doesn't yet exist, assign it an unused page in DRAM
    1. page exists but was on disk
      1. Initiate transfer of the page contents we're requesting from disk to DRAM, assigning to an unused DRAM page


Size of Linear Page Table

With 32-bit memory addresses, 4KB(2122^{12} bytes) pages ⇒ 232212=220{2^{32} \over 2^{12}} = 2^{20} Virtual Pages per user(process), if we assume 4-Byte PTEs, 2202^{20} PTEs, 2222^{22} Bytes required: 4MB page table per process!

You may think that we can make each virtual page larger?

However, larger pages means:

  1. Internal fragmentation (Not all memory in page gets used)
  1. Larger page fault penalty (more time to read from disk)

Thinking about 64-bit virtual address space, even 1MB pages would require 2442^{44} 8-Byte PTEs (35TB)

However, most processes only use a set of high address (stack), and a set of low address (instructions, heap)

So we will use Hierarchical Page Table ⇒ this exploits sparsity of virtual adress space use


Cost of Virtual Memory

There's a cost to virtual memory, namely the price of Address Translation & Memory Protection

Address translation is very expensive as in a two-level page table, each reference becomes several memory accesses.

To account for the inefficiencies of address translation, we introduce Translation Lookaside Buffers (TLB) that caches some translations

  1. For a single TLB hit, we will now only have address translation that costs one cycle
  1. For a TLB miss, we will need a Page-Table walk to get the physical address as well as to refill the TLB table.

TLB Reach = Size of largest virtual address space that can be simultaneously mapped by TLB

ReachTLB=#entries×Sizepage×#pagesentryReach_{TLB} = \#_{entries} \times Size_{page} \times {\#_{pages} \over entry}


TLB Designs

  1. Typically 32-128 entries
  1. Each entry maps a large page
    1. So less spatial locality across pages
  1. Sometimes fully associative, larger TLBS(256-512 entries) are 4-8 way set-associative
  1. Larger systems sometimes have multi-level (L1, L2) TLBs
  1. Random or FIFO(First-In-First-Out) replacement policy
  1. Two styles of refill
    1. MIPS style
      1. The TLB is the only translation in the hardware
      1. Whenever you get a TLB miss you jup to the page fault handler
    1. x86
      1. The page table has a defined structure
      1. In the event of a TLB miss the hardware walks the page table
        1. Only if the page is unavailable you jump to the page fault handler
    1. RISCV Prefers x86 style, but is compliant with MIPs

Finished TLB workflow


Some Virtual Memory Tricks

  1. Copy-On-Write Duplication
    1. Split a process and now have two processes (fork)
    1. Copy the page table and registers
      1. and mark both the original and copy's memory as read-only
    1. Every time either process wants to write a page...
      1. Traps to the protection fault handler
      1. The fault handler copies the page, and updates both page-tables to allow writing
    1. And now we only copy memory when we need to first write it.
  1. Shared Dynamically Linked Libraries
    1. Two virtual PTE pointed to the same physical memory space
  1. Memory Mapped File
    1. "Load the entire file into a contiguous block of memory"
    1. The system just points the PTE to disk ⇒ when the program actually reads, generates a page fault to OS and OS loads the page into memory
  1. Dirty bit in PTE
    1. Need to swap out the page, then if it is dirty write it out

RISC-V

Simpel RISC-V RV32I Datapath Diagram (From CS61C Fall 21 Slides)

Some Good Material...

https://inst.eecs.berkeley.edu/%7Ecs61c/resources/riscvcard.pdf
RISC-V Card (Official Cheatsheet Provided by CS61C Staff)

Assembly Language

Immediates are "sign-extended"

Calling Convention

Two types of registers in calling convention

  1. Callee saved ⇒ The function that gets called saves it at the beginning and restores before returning
  1. Caller saved ⇒ The function that gets called do whatever it wants with this register, the function that calls was responsible for storing it before calling and restoring it after calling.

CALL(Compiler ⇒ Assembler ⇒ Linker ⇒ Loader)

Compiler(CS164) ⇒ Most Computationally Heavy in the CALL chain

Higher-Level Language Code (foo.c) ⇒ Assembly Language Code (foo.s)

Code Matches Calling Convention for the architecture.

Output may contain pseudo-instructions

  1. Lexer: Transforms input ⇒ tokens
  1. Parser: Tokens ⇒ Abstract Syntax Tree
  1. Semantic Analysis and Optimization: Checks for semantic errors (语义错误), may reorganize code to make it better.
  1. Code Generation: Outputs the assembly code

Assembler: dumb compiler for assembly language

Assembly Language Code (foo.s) ⇒ Object Code, Information Tables (foo.o)

Assembler DirectiveDescription
.textSubsequent items put in user text segment (machine code)
.dataSubsequent items put in user data segment (binary rep of data in source file)
.globl symdeclares sym global and can be referenced from other files
.string strStore the string str in memory and null-terminate it
.word w1.....wnstore the n 32 bit quantities in successive memory words
  1. Reads and uses Directives
  1. Replaces Pseudo-instrutions
  1. Produces Machine Language rather than just Assembly Language
  1. Outputs Object File

Tail call optimization

int doSth(){
	....//lots of code
	return foo(y);
}
  1. For efficiency, evaluate the arguments for foo() and place them in a0-a7
  1. Restore ra, all callee saved registers, and sp
  1. call foo() with j or tail
  1. foo() will return directory to where doSth needs to return to

Forward Reference Problem

Branch instructions can refer to labels that are "forward" in the program

L1:
slt t0, x0, $a1
beq t0, x0, L2
addi a1, 1, -1
jal x0, L1
L2:
add $t1, $a0, $a1

Solved by taking 2 passes over the program

  1. First pass remembers positions of labels
  1. Second pass uses label positions to generate code

We solved the forward refering problem(branch), but we might be referencing libraries outside of current file(jumps, static data addresses), so how do we account for these?

Symbol Table

contains List of "items" in this file that may be used by other files

items include:

  1. Labels for function calling
  1. Data: anything in the .data sections and variables which may be accessed across files

Relocation Table

List of "items" this file needs the address of later

  1. External label jumped to
  1. Any piece of data referenced by static loading, such as la

Object File Format (Standard Format is ELF, except for Microsoft)

  1. Object file header: size and position of the other pieces of the object file
  1. Text segment(.text): the machine code
  1. Data segment(.data): binary representation of the static data
  1. Relocation Information: identifies lines of code that need to be fixed up later
  1. Symbol Table
  1. Debugging Information

Linker

Object code files with information tables (foo.o, libc.o) ⇒ Executable code (a.out)

  1. Combines several object files into a single executable
  1. Enables seperate compilation of files so that changes to one file do not require recompilation of the whole program

Steps:

  1. Take text segment from each .o file and put them together
  1. Take data segment from each .o file and put them together, concatenate this to end of text segments
  1. Resolve References
    1. Linker assumes first word of first text segment is at 0x04000000 (virtual memory)
    1. It knows:
      1. length of each text and data segment
      1. ordering of text and data segments
    1. It calculates:
      1. absolute address of each label and each piece of data being referenced
    1. To resolve,
      1. It searches for reference in all "user" symbol tables
      1. If not found, search library files
      1. Once absolute address is determined, fill in the machine code appropriately

Loader

Load program into memory and kickstart the program. (usually implemented by OS)

  1. Reads executable header and determines size of text and data
  1. Creates new address space for program large enough to hold text and data segments, as well as stack segment
  1. Copy instructions and data from file into the new address space
  1. Copies arguments passed to the program onto the stack
  1. Initializes machine registers
    1. Most registers cleared, but stack pointer assigned address of 1st free stack location
  1. Jumps to start-up routine that copies program's arguments fro stack to registers & sets the PC
  1. If main routine returns, start-up routine terminates program with the exit system call
  1. Also responsible for linking dynamically linked libraries(DLL)

Dynamically Linked Libraries

  1. Storing a program requires less disk space
  1. Executing two programs requires less memory (if they share a library)
  1. At runtime, there's time overhead to do link

Parallelism

Calculating Speedup

Speedupwith_enhancement=Timewithout_enhancementTimewith_enhancementSpeedup_{with\_enhancement} = {Time_{without\_enhancement} \over Time_{with\_enhancement}}

Timewith_enhancement=Timewithout_enhancement×[(1F)+F/S]Time_{with\_enhancement} = Time_{without\_enhancement} \times [(1-F) + F/S]

Speedupwith_enhancement=1/[(1F)+F/S]Speedup_{with\_enhancement} = 1/[(1-F) + F/S]

F = fraction of the task that got affected

S = Speedup factor for the fraction of the task that is affected

Amdahl's Law

If the portion of the program that can be parallelized is small, then the speedup is limited

Strong and Weak Scaling

To get a good speedup on a parallel processor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem

Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem

Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proportionally to the increase in the number of processors

It is always important to get things right and then optimize because your runtime of the program is really time to code it + time to run it on the manchine.

Data Level Parallelism

Increases Throughput

Single-Instruction/Multiple-Data Stream ⇒ SIMD

  1. Processes multiple data streams using a single instruction stream
  1. Intel SIMD instruction extensions / GPU
  1. Higher throughput per $
    1. Much simpler control logic
  1. Easy to map to MIMD
  1. Requires less memory since there's less instructions
  1. Less cost per unit
  1. 1 Instruction Decoder
  1. Less Complexity
  1. Latent/Tacit Synchronization

Multiple-Instruction/Multiple-Data Streams ⇒ MIMD

  1. Multiple autonomous processors simultaneously executing different instructions on different data
  1. multicore / warehouse-scale computers
  1. Lower throughput per $
  1. VERY hard to map to SIMD
  1. Requires more memory since there's more instructions
  1. More cost per unit
  1. 2+ Instruction Decoder
  1. More complexity
  1. Accurate or Explicit Synchronization

Loop Unrolling

Optimizing compilers usually perform this job.

  1. Expose data-level parallelism for vector(SIMD) instructions or super-scalar multiple instruction issue
  1. Mix pipeline with unrelated operations to help with reduce hazards
  1. Reduce loop "overhead"
  1. But also... makes code size larger

Thread Level Parallelism

Thread ⇒ A sequential flow of instructions that performs some task

Each core provides one or more(superscalar) hardware threads that actively execute instructions

Operation system multiplexes multiple software threads onto the available hardware threads

HyperThreading

Processor resources are expensive and should not be left idle

Hardware swithces threads to bring in other useful work while waiting for cache miss

So we can put in redundant hardware ⇒ don't have to save context on every thread switch

This is attractive for apps with abundant TLP (Thread-level Parallelism)

Data Race

Two memory accesses of the same location from different threads, of which least least one is a write form a data race

So we'll use "Lock" to grant access to a region (critical section)

Test-and-Set

In a single atomic operation:

  1. Test to see if a memory location is set (contains 1)
  1. Set it (to 1) if it isn't (it contained a zero when tested)
    1. Otherwise indicate that the Set is failed, so the program can try again

OpenMP

#pragma omp parallel
{
	/* blablabla */
}

#pragma omp parallel for
for(int i=0; i<10; i++){
	/* assigns each i to different threads */
}

int sum = 0;
#prgama omp parallel reduction(+: sum){
	/* blablabla */
}

//Possible operators include: + / * / - / min / max / & / && / | / || / ^

Private Variables

#pragma omp parallel private (var1)
{
	/* blablabla */
}

Methods

omp_set_num_threads(x);
num_th = omp_get_num_threads();
th_id = omp_get_thread_num();
double omp_get_wtime(void);

Request-Level Parallelism

Common IO Devices

Disk

Non-volatile storage that is cheap, large, slow

Two Types:

  1. Hard Disk Drives (HDD) 硬盘 - faster, more dense, non-removable
  1. Floppy Disks 软盘 - slower,less dense, removable (replaced by USB "flash drive")

Disk Device Terminology

  1. several plattes, with information recorded magnetically on both surfaces
  1. Bits recorded in tracks, which in turn divided into sectors (e.g. every 512B can be a sector)
  1. Actuator moves head over track ⇒ seek, wait for sector rotate under head, then read or write

The closer the head to the disk, the smaller the "spot size" and thus the dense the recording

Spot size

  1. Measured in Gbit/in2Gbit/in^2
  1. ~900 Gbit/in2Gbit/in^2 is state of the art

Disks are sealed to keep the dust out so heads can fly at around 3-20nm above the surface of the disk, 99.999% of head/arm weight is supported by the air bearing force(air cussion) between the disk and the head

Disk Access Time = Seek Time + Rotation Time + Transfer Time + Controller Overhead

  1. Seek Time = time to position the head assembly at the proper cylinder
    1. Average # of tracks to move arm = Number of tracks / 3 ⇒ Seek time = # of tracks moved * time to move across one track
  1. Rotation time = time for the disk to the point where the first sectors of the block to access reach the head
    1. Use average distance of sector from head = 1/2 time of a rotation
  1. Transfer time = time taken by the sectors of the block and any gaps between them to rotate past the head

Many disks have on-disk caches, which are completely hidden from the outside world, so estimates are different in practice.

Solid State Drive (SSD) / Flash Memory

NMOS transistor with an additional conductor between gate and source/drain which traps electrons.

Memory cells can only withstannd a limited number of program-erase cycles. Controllers use a techinque called wear leveling to distributes writes as evenly as possible across all the flash blocks.

Band with is similar to spinning disk

But there's no seek time so no additional lattency for random access vs. sequential acess of a block

Networking

OSI 7 Layer Network Model

Political - You shall not encrypt...

Application: TLS + HTTP...

Transport: TCP/UDP

Network: IPv4 / IPv6

Data Link: Ethernet

Considerations in building systems

Dependability

Types of Faults in Digital Designs

  1. Design Bugs (function, timing, power draw)
    1. Detected and corrected at design time through testing and verification
  1. Manufacturing Defects (Violation of design rules, inpurities in processing, statistical variations)
    1. Post production testing for sorting
    1. spare on-chip resources for repair
    1. Dealing with this in ICs:
      1. Designers provide "test vectors"
        1. Tools help with ATPG(Automatic Test Pattern Generation)
      1. Special on-chip circuits help speed the testing proess
        1. BIST (built in self test), Scan-chains
  1. Runtime Failures (physical effects and environmental condtions)
    1. "Hard Faults": aging
    1. "Soft(transient Faults": electro-magnetic interference, cosmic particles
    1. Dealt with:
      1. Redundancy
      1. Measures of dependability
      1. Codes for Error Detection/Correction
      1. Protecting Disk-Drives Against Errors

Dependability(可靠性) Via Redundancy

  1. Spatial Redundancy
    1. Replicated data or extra information / hardware to handle hard and soft failures
  1. Temporal Redundancy
    1. Redundancy in time(retry) to handle soft failures

Dependability Measures

Reliability: Mean Time To Failure (MTTF)

Service Interruption: Mean Time To Repair (MTTR)

Mean Time Between Failures (MTBF) = MTTF + MTTR

Availability(可用性) = MTTF/(MTTF+MTTR)MTTF / (MTTF+MTTR)

Improving Availability

  1. Increase MTTF: More reliable hardware/software + Fault Tolerance
  1. Reduce MTTR: improved tools and processes for diagnosis and repair

Annualized Failure Rate (AFR) = Average Number Of Failures Per Year

Dependability Design Principle

No single points of failure

It follows barrel effect ⇒ Dependability of the entire system is limited by the part that has lowest dependability

Error Correction/Detection Codes(ECC / EDC)

Error Detection Coding - Parity Bit

Each data value is tagged with an extra bit to force the stored word to have even parity (even number of "1"s)

Of course can also use odd parity bit to protect against error

Hamming ECC

Hamming Distance = # of bit positions where words differ

Of course, you will see that 3b1b has a much better explanation than I do...

How to send a self-correcting message (Hamming codes)
A discovery-oriented introduction to error correction codes.Part 2: https://youtu.be/b3NxrZOu_CEBen Eater:'s take: https://youtu.be/h0jloehRKasHelp fund futu...
https://youtu.be/X8jsijhllIA
Hamming codes part 2, the elegance of it all
Part 1: https://youtu.be/X8jsijhllIAWatch Ben Eater's video: https://youtu.be/h0jloehRKasHelp fund future projects: https://www.patreon.com/3blue1brownAn equ...
https://youtu.be/b3NxrZOu_CE

RAID ⇒ Redundant Arrays of Inexpensive Disks

But reliability...

So we introduce RAID!

  1. Files are "striped" across multiple disks
  1. Redundancy yields high data availability
  1. Disks will still fill
    1. But we can reconstruct contents from data redundantly stored in the array
  1. 6 Raid Levels, 0, 1, 5, 6 most common today

Raid 0: Striping

Raid 1: Disk Mirroring / Shadowing (online sparing)

RAID 2 Illustration

Raid 2: Hamming Code for Error Correction

Raid 3: Single Parity Disk (bitwise)

  • Disk drives themselves code data and detect failures
  • Reconstruction of data can be done with single parity disk if we know which disk failed
  • Writes change data disk and P disk

RAID 4: High I/O Rate Parity (blockwise)

  • Interleave data at the setor level (data block) rather than bit level
    • Permits more parallelism
    • Independent small reads, parallel large reads
  • Reconstruction of data can be done with single parity disk
  • Reading without fault involve only data disk
  • Writing involves data + parity disk

RAID 5: High I/O Rate Interleaved Parity

  • Independent writes possible because of interleaved parity
  • No longer the gold standard
    • can experience 1 disk failure and contine operation
    • But disk failures are not independent!

RAID 6: Add another parity block per stripe

Warehouse Computing

Measurements

Power Usage Effectiveness

Energy Efficiency is the primary concern in the design of WSC(Warehouse-Scale Computing)


Power Usage Effectiveness(PUE)

PUE=Powertotal_buildingPowerIT_equipmentPUE = {Power_{total\_building} \over Power_{IT\_equipment}}

Cloud Services

MapReduce

Specify the computation in terms of

Fine granularity tasks 细粒度任务: Many more map tasks than machines

MapReduce Process: