Pipelining
Introduction
Stages
- Fetch
- Decode
- Execute
- Memory
- Writeback
Pipeline adds registers between stages to transfer data
Pipeline characteristics
- Pros:
- More efficient hardware use b/c different parts of CPU can be used concurrently
- Large batch of instructions completes more quickly
- Increase throughput
- Cons:
- Increase latency on individual instructions (due to pipeline registers)
- Clock has to wait for slowest stage
- Hazards
- Splitting into more stages usually introduces shorter clock cycles, thus higher throughput and more efficient in large batches of instructions
- Pipelining is most effective when each stages takes similar time
- Can:
- Make multiple instructions in-flight at the same time
- Cannot:
- Make processor fetch multiple instructions at the same time
- be the only way to achieve hardware parallelism
Latency
- Instruction latency
- Sequential: simply add all stage execution times and register delay
- Pipelined: (longest stage + register delay) * number of stages
- Retirement latency
- Sequential: same as single instruction
- Pipelined: slowest stage + register delay
Throughput
GIPS (giga instructions per second) = 1000 / retirement latency (in picoseconds)
Pipeline registers
Holds signals for each stage to use
Latching: captures the signals flowing in the pipeline registers and hold them till next clock cycle
Instruction memory and data memory
Each can have error checking (for bad address, invalid instructions). Will make more sense in caching
Fetch
calcPC: calculated PC
Decode
stat: if instruction is valid or not
icode ifun rA rB valC valP
Execute
stat icode ifun valC
New:
valA valB: values from register
dstE dstM: for register E and M
Memory
stat(may change value after memory stage) icode valA dstE dstM
New: valE Cnd(condition codes)
Writeback
stat icode valE dstE dstM
New: valM(value from memory)
Signal naming conventions
S_signalfor signal from S stage- e.g.
D_stat: right out of decode
- e.g.
s_signalfor signal at the end of s stage- e.g.
m_stat: after the logic that checks memory address in memory phase
- e.g.
Pipeline stalling
addq %rax, %rbx
subq %rbx, %rcx
In this example, the CPU needs to stall 3 cycles (without forwarding)
Counting stalls
- Assume the first instruction enters at time T=1 (fetch)
- When does the second instruction need the data? say at T=3
- When will the first instruction write the data? say at T=5
- Number of stalls = \(5+1-3=3\)
- (+1 because the writing stage needs to finish)
Performance
CPI (cycles per instruction): how efficiently a program is running on a hardware
For y86, the CPI is 1 if not stalling; modern hardware has CPI < 1
Remember when counting cycles: the last instruction still needs to go through the pipeline, so we need to add 4 cycles.
Data hazards
Data forwarding
In the addq subq example, the addq instruction has the value valE ready in execute stage, and the subq instruction needs the value in decode stage, so we can wire e_valE into a forwarding logic block, which will pass them to pipeline registers for execute.
Things that don't work:
1. forwarding e_valE into the register file: the value will only be available for reading in the next cycle and cause stalling (cannot read and write to register at the same time)
2. forwarding M_valE into the forwarding logic: M_valE is only available in the next cycle
3. forwarding e_valE into the beginning of the execute stage: forwarding to the same instruction, will create loop
Control hazards
jmp: just forwardf_valCintonext PClogic and latch intoF_calcPC, no hazardscall: same, just usevalC, no hazardsret: requires address fromW_valM, hazard, 3 stallsjxx(conditional jump): need to choose betweenvalCand PC increment, 2 stalls without prediction
Branch predictions
- Backward jumps: loop, always taken
- Forward jumps: conditionals, never taken
Linux kernel can let you decide which branch is more likely to jump
How to implement?
- Forward
valCand PC increment into new logicpredictor predictorsends value intoF_predPC(used to beF_calcPC)
What happens if mis-predicted?
jxxgets to execute phase, realizes prediction was wrong- Bad news: two wrong instructions in pipeline
- Good news: the bad instructions haven't reached memory or writeback stages, so nothing has been changed
- Solution: squash the instructions by replacing with
nop - Penalty, but no worse than stalling
CPI
- CPI when squashing: number of cycles in total (including squashed ones b/c already entered pipeline) / number of instructions after squashing
- Good branch prediction can reduce CPI
Other kinds of parallelism
- Multiprocessing
- Parallel runtimes
- Multithreading
- Hyper threading: while waiting for memory, work on another thread
- two copies of registers
- Distributed systems
- Data centers
Flynn's taxonomy
xIxD
- Instruction stream
-
Data stream
-
SIMD: GPUs
- MIMD: multiprocessing