Pipelining

Introduction

Stages

Fetch
Decode
Execute
Memory
Writeback

Pipeline adds registers between stages to transfer data

Pipeline characteristics

Pros:
- More efficient hardware use b/c different parts of CPU can be used concurrently
- Large batch of instructions completes more quickly
- Increase throughput
Cons:
- Increase latency on individual instructions (due to pipeline registers)
- Clock has to wait for slowest stage
- Hazards
Splitting into more stages usually introduces shorter clock cycles, thus higher throughput and more efficient in large batches of instructions
Pipelining is most effective when each stages takes similar time
Can:
- Make multiple instructions in-flight at the same time
Cannot:
- Make processor fetch multiple instructions at the same time
- be the only way to achieve hardware parallelism

Latency

Instruction latency
- Sequential: simply add all stage execution times and register delay
- Pipelined: (longest stage + register delay) * number of stages
Retirement latency
- Sequential: same as single instruction
- Pipelined: slowest stage + register delay

Throughput

GIPS (giga instructions per second) = 1000 / retirement latency (in picoseconds)

Pipeline registers

Holds signals for each stage to use

Latching: captures the signals flowing in the pipeline registers and hold them till next clock cycle

Instruction memory and data memory

Each can have error checking (for bad address, invalid instructions). Will make more sense in caching

Fetch

calcPC: calculated PC

Decode

stat: if instruction is valid or not
icode ifun rA rB valC valP

Execute

stat icode ifun valC
New:
valA valB: values from register
dstE dstM: for register E and M

Memory

stat(may change value after memory stage) icode valA dstE dstM
New: valE Cnd(condition codes)

Writeback

stat icode valE dstE dstM
New: valM(value from memory)

Signal naming conventions

S_signal for signal from S stage
- e.g. D_stat: right out of decode
s_signal for signal at the end of s stage
- e.g. m_stat: after the logic that checks memory address in memory phase

Pipeline stalling

addq %rax, %rbx
subq %rbx, %rcx

In this example, the CPU needs to stall 3 cycles (without forwarding)

Counting stalls

Assume the first instruction enters at time T=1 (fetch)
When does the second instruction need the data? say at T=3
When will the first instruction write the data? say at T=5
Number of stalls = \(5+1-3=3\)
- (+1 because the writing stage needs to finish)

Performance

CPI (cycles per instruction): how efficiently a program is running on a hardware

For y86, the CPI is 1 if not stalling; modern hardware has CPI < 1

Remember when counting cycles: the last instruction still needs to go through the pipeline, so we need to add 4 cycles.

Data hazards

Data forwarding

In the addq subq example, the addq instruction has the value valE ready in execute stage, and the subq instruction needs the value in decode stage, so we can wire e_valE into a forwarding logic block, which will pass them to pipeline registers for execute.

Things that don't work: 1. forwarding e_valE into the register file: the value will only be available for reading in the next cycle and cause stalling (cannot read and write to register at the same time) 2. forwarding M_valE into the forwarding logic: M_valE is only available in the next cycle 3. forwarding e_valE into the beginning of the execute stage: forwarding to the same instruction, will create loop

Control hazards

jmp: just forward f_valC into next PC logic and latch into F_calcPC, no hazards
call: same, just use valC, no hazards
ret: requires address from W_valM, hazard, 3 stalls
jxx (conditional jump): need to choose between valC and PC increment, 2 stalls without prediction

Branch predictions

Backward jumps: loop, always taken
Forward jumps: conditionals, never taken

Linux kernel can let you decide which branch is more likely to jump

How to implement?

Forward valC and PC increment into new logic predictor
predictor sends value into F_predPC (used to be F_calcPC)

What happens if mis-predicted?

jxx gets to execute phase, realizes prediction was wrong
Bad news: two wrong instructions in pipeline
Good news: the bad instructions haven't reached memory or writeback stages, so nothing has been changed
Solution: squash the instructions by replacing with nop
Penalty, but no worse than stalling

CPI

CPI when squashing: number of cycles in total (including squashed ones b/c already entered pipeline) / number of instructions after squashing
Good branch prediction can reduce CPI

Other kinds of parallelism

Multiprocessing
Parallel runtimes
Multithreading
Hyper threading: while waiting for memory, work on another thread
- two copies of registers
Distributed systems
- Data centers

Flynn's taxonomy

xIxD

Instruction stream
Data stream
SIMD: GPUs
MIMD: multiprocessing