[Musing] Essense of machine learning hardware
12 Mar 2023Here are few design considerations for machine learning asic. Design traits will achieve efficiency/effectiveness per watts. (Tacit Ideas)
Design Paramter : ML Model
- Feed-forward ANN are faster since no data feedback simplfied desgin
- LSTM require accumlation of partial results, increases complexity
- Type of Neural Network affects simplifications to hardware, Usage of domain knowledge.
Design Parameter : Hardware use
Common Sense Idea : Inference and Training hardware are different (most time)
- Hardware for training includes input with large dataset, cost function , optimizations function and model
- Large dataset implies more memory for features, bag of features
- cost and optimzation function use are feedback to improve model with indentifying error. implies “latency of feedback”.
- Support for debugging model to look for training errors, generalization errors implies “profiling interfaces”
- Should have support for workflows
- Hardware for inference
- It includes input as dataset, not as large as training dataset and model.
- Model can be compressed based on domain of use.
Design Parameter: Input Dataset
Common Sense Idea: Use known information to minimize data set
- Use of Encoding.
- Use of Precise data storage. Less Register Size implies better energy/storage efficiency E.g. ML Model with human Age as input parameter. Ideally is 32 bit int. but age of human is < 100 implies 7 bit size. After understanding use cases, model only understand age ranges with widht of 10. implies 4 bit data.
- Use of Data quantization. e.g. bfloat16 or MSFP (great common sense idea)
Design Paramter : Signal Data Processing
- For Vision, audio signals as input, FFT has saved computation significantly!!! ( WOOWWWW!)
Design Parameter : ALU (PE) Computation
- List All Unit Operations and minimize to bare minimum required. implies minimum area
- For each operation minimize register size -> aguments datatype
- For each operation minimize microcode with basics e.g. MAC Design (Multiply and Add) MAC for N bit x 2 regs, with result if 2N muls and add to N again Less Register Size, less microcode implies better energy/storage efficiency
- Unit Operations in ALU
- Granularity of Unit Operation designed in hardware should be Maximizing paralization, at cost of space, later in time.
- Examples of Unit Operation
- Convolution aka (dot product) , Matrix Multiply. (Learnings from CPU based MatMul) lead to systolic execution of MatMul (MAC) over General MatMul (GMEM).
- Pooling aka ( Normalization) , thresholding (Learning from Comparator design in HDL) Creating minimized Comparator logic ckt. (Spatial Reduction Case)
Design Parameter : MOV data
- More Cycles to get data, implies More stalls in ALU, implies high Energy
- More frequent access to to distant memory in Memory heirarchy, Explosively increases energy consumption
- Always Pipeline data, Move data in Spatial and Temporal Ways.
- Machine Learning Programming Langauge needs Tensor Loops for Time and Space maximization
- Decoding Cost of MOV
- Reg < Cache < Buffer < DRAM
- Large Register size better
- Double Buffering is good
- RAM technology. SRAM, DRAM, HBM
- Buffer Specialization, Read only buffers and write only buffer perform better than R/W
- Quantifiable Design Concept
Co-designing PE-MEM
- For Typical Operation, 2x DDR MOV RD to Reg, 1 MAC, 1x DDR MOV WR from Reg
- Considering NN Type,
- Calculate Order of Execution and Data Memory Access Patterns
- idea check : Why Build Simulator that identifies Patterns of Data movements!!!
- For Accumlation PE Pattern, ADDER Tree implementation and SYSTOLIC accumlation for matrices
-
For Non-Accumation Pattern, Direct Wiring multicast and SYSTOLIC multicast for matrices
- Keep Stationary Data in closest register,without movements if possible
- idea check : Register Allocation feedback based based on Spill factor and Movement factor.
- Move Input into Register, Keep Weights stationary or Keep Partial Sums and Output Stationary.
- Scale MAC and LOCAL Memory up.
- Calculate Order of Execution and Data Memory Access Patterns
Design parameter : ISA Decoder
- CPU/GPU Decoders are complex for ML Ops.inefficiency into Pipelining Operations
CKT - Short Decoder sequences better, small, efficient for only Unit Ops.
- SIMD decoder <> VLIW decoder ckt
Design Parameter : Compiler Support
- Input (Tensor) Slicing at programming lang and backend level
- Re-organize Loop order, polyhydrals for better memory movements, split loops into more loops.
- Optimize for Area x Energy
Design Paramter : Choice of Hardware transistor tech
- High Frequency, high throughput results in heating
- Low frequency, high thougput, optimal area serves better.
- Diaelectric Silicon technology in nm
- Photonic Technology
- Calculate Compute Density , Compute to data movement ratio to classify IO dominant or Compute Dominant