[Notes] Machine Learning Hardware CUstomizations
01 May 2023Continuation on Essense of Machine Learning Hardware,….
What is customized for each ML App on Hardware?
- Memory Hierarchy (only interested in memory near ALU)
- IO Communication ( Not interesting enough)
- Data types
- ALUs
Components of ML program?
- Loop All dimentions and MAC input and weight to produce output
- This is computing one by one row
- We could send complete row to compute , second forloop by N/group_size aka “Scheduling” Scheduling can include Loop Splitting, loop fusion , loop reorder , loop unroll and pipeline
- Data layout can be parttioned and packed or restructured with affine mapping
- Kerenl can use low bit representation “quatization”
-
During row by row dispatch, buffers can be reused in next iteration, to hide latency (PIPO) “Data Placement”
- ISSUE : if we write xilinx HLS way, customization requires rewrite, programmers manual efforts in order of looping,quatization and buffer placement.
How to built system that supports customization by machine?
TVM approach
- Specification of MAC as AST and Shape.
- TVM declartions accept type customization in python AST
- TVM uses AST as compute for PE and finds out schedule, instead of doing polyhydral transforms for affine ops
- TVM’s genetic machine learning algorithms optimizes MAC operation by calculating different loop schedule to finds first reasponable solution.
References
- Software Defined Reconfigurable Computing