[Notes] Notes Relay IR from Apache TVM
02 Apr 2023- designed for ASIC/FPGAs for dynamic
- Uses Pure Functional, statically Typed to support efficient compilation, portability and expressive ness
-
User models are computation graphs representing differentiable computation
- Tensor FLow
- does static graph, easy to optimize, not computational expressive as HLL and have dataflow graph with fix topology at compile time,
- TF has restricted Control flow, can not adpot to shape of compute graph. ( Define-then-run Model)
- PyTorch
- does graph with dynamic topologies by using runtime data and add imperative computation. GPU works good. Not Accelerator or FPGA when needs deployement with python (Define-by-run Model).
- dyanamic framework control flow is lost due to software boundary -> reduced optimization chances
- re-optimize any time the graph topology changes, costing CPU cycles and the overhead of moving data between the host and accelerators pytorch solved it by adding IR abstraction (solution : python -> IR , same as static grpah)
- Expressive for user, limits ability for existing frmwrk to optimize user defined graph
Relay IR
- Pure Functional Programming Lang for differentiable computation only.
- Functions as first class objects
- Rich Type system dependent on shape
- Automatic differention operator accepts lamba function/closure as input
- Adds operators from dense linear algebra .g. GEMM, Convolution
- uses Algorithmic scheduling over polyhydral in Glow or LLVM
- Evaluator for debugging
- type specialize operator building compiler
-
Use of “let binding” from PL
- TVM framework [ functional IR , Type system , Synthesis , CodeGen-Rewrites , Compiler Classis]
- Tensorization- optimizer decomposing and matching programs to the underlying hardware tensor operations (Compiler : Vectorization)
- Operator Fusion (Compiler: Peephole)
- Layout Change (Compiler: Loop Optimizations) invertible affine transformation
- Memory Reuse at Graph and Operator Level (Compiler: minimize buffer/register spill)
-
Latency Hiding (Compiler: Interface Optimizations)
- Whole Program representation with deep learning programs allows at user level
- host slicing
- Dynamic Networks
- Change of Layouts
- Latency Hiding
- Parallel and Distributed Scheduling
Frontends
- Text AST written in Python or C++
- Python Frontend -> Adds decorator annotation “relay” and “relay_model”
- JSON serialization
Automatic Differentiaton
For function(x,x”)–> Carry both values over All instruction in Function, calculated forwardly.
Scheduling
- scheduling primitives - loop transformations, inlining, vectorization,…
- All possiblities in Transformations create search space , ML Genetic Algorithm
- Start with Empty Policy
- calculate fitness score
- Generate Mutation to existing known policy and calculate fitness score
- Update bag of Population to fitness score , repeat again
- flops is silent measure for time cost
- Search Policy
- Empty naive input
- Sketch policy = Genetical evolution with Cost model and rules