[Cpp Notes] Low Latency C++

03 Aug 2023

Metric

Latency = time taken to Deliver
Throughput = Amount of data per unit time

Why C++

Manual Memory management over hot path
Scalable Zero cost abstractions
Exisiting frameworks

Measuring

CPU - gprof
Cache - cachegrind
Pipeline
Code Size less != fast code
Benchmarking
- Quick bench
- Google Benchmark

Efficient Programming

Requires knowing
- Lang
- Compiler
- Library
- Hardware
Avoid
- Copies
  - Use References
  - Move loop independant code out of loop
- Too Many function calls
  - Inline
  - alternatives to virtual call, Use CRTP or std::variant
- Compile time stuff
  - constexpr
  - Metaprogramming

Low-level bit manipulation in C++

Object lifetime rules
Aliasing rules
Alignment rules
Object representations
Value representations
std::bit_cast (since C++20)
Implicit-lifetime types (since C++20)
std::start_lifetime_as (since C++23)

Optimizing

Knowing Undef beh
Adding attributes

int divide_by_32(int x) {
[[assume(x>=0)]];
return x/32;
}

Pipelining

Avoid Branch Mispredict using attribute [[likely]] [[unlikely]]

Rewrite Conditionals

  if(x || y) {}  // into
  if(bool(x) + bool(y))	

 sum += cond ? expr1 : expr2; // into
 decltype(expr1) temp[] = { expr1, expr2};
 sum += temp[bool(cond)];

Avoid data hazards

Autovectorization

Compiler Dependent
works for basic types
Prefer array of structs over struct with array
linear traversal
Loop is for
Compile time iteration bounds
No data dependant breaks
No conditionals or aliasing or data dependence on iterations

  int a[1000], b[1000], c[1000];
  for (int i = 0; i < 1000; ++i) {
      a[i] += b[i];
      b[i+1] += c[i];	
  }

  // into
  // Remove index dependance 
  a[0] += b[0];
  for (int i = 1; i < 999; ++i) {
      b[i+1] += c[i];
      a[i+1] += b[i+1];
  }
  b[999] += c[999];

architecture specific dispatch

  __attribute__ ((target ("arch=atom"))) int foo ()

Cache

Minimize data miss
- data locality
- Alignment of data
- Concurrency aspect of sharing data between cores
- Good Layout for prefecher
- Cache Friendly
  - Data structure like flatmap
  - Algorith like binary search
Minimize instruction miss
- Code Layout and alignment
- Avoid branch and virtual

Deterministic Time for program

AVOID this in Hot Path
- Dynamic Allocation
  - Data structure or Algo or data types that allocate
  - e.g. std::stable_sort allocates
  - e.g. std::array or std::optional or std::variant does not allocate
  - e.g. std::any or std::function or std::vector allocates
  - Custom allocators not good eg tcmalloc , rpmalloc for low latency
  - Use pre-allocation
    - Monotonic alloctor std::pmr::monotonic_buffer_resource
    - Pool Allocator std::pmr::unsynchronised_pool_resource
    - Frame/ Arena allocators
    - LockFree Allocators
- Blocking call
  - Dont use coroutines
  - Terms
    - Atomic = race free
    - Lock free = min one thread will make progress
    - Wait free = all threads will make progress
  - No Mutex, COnditional Var or lock or semaphores!
  - Only use std::atomic but not spin on it or do std::atomic<T>::wait/notify
  - checking lock free with static_assert(std::atomic<T>::is_always_lock_free)
  - Can use wait free q TODO
  - can use try_lock
  - can use double buffering
- IO
  - pre-load and lock address range
- Exceptions
- Context and Mode switch
  - pin hot path thread onto one cpu. No hyperthreading
- syscall - No
- Calling unknw code
- Loops without clear bounds
- Algo bounds higher than O(1)

Computational Matters Blog

[Cpp Notes] Low Latency C++

Metric

Why C++

Measuring

Efficient Programming

Low-level bit manipulation in C++

Optimizing

Cache

Deterministic Time for program

Related posts

[Notes] Floating Point Semantics in Compiler 10 Nov 2024

[Notes] LLVM Machine Scheduler 15 Mar 2024

[Paper] Bring Your Own Codegen to Deep Learning Compiler 01 Aug 2023