Explore my works and
side projects  here

Research, Design & Development.

These are just a few examples of unconventional manipulation tricks or hacks used to enhance code performance.

Each technique has its own advantages and trade-offs, and their effectiveness depends on the specific context and requirements of the application.

Depending on the specific requirements and constraints of your application, different techniques may be more or less effective.

These extreme optimisation techniques require advanced knowledge of computer architecture, low-level programming, and performance analysis. They are typically used in specialised domains where performance is critical and justify the additional complexity and effort involved.

  • Duff’s Device: This is a programming technique used to unroll loops in a way that exploits fall-through switch statements to reduce loop overhead. It’s named after Tom Duff, who first introduced it in 1983.
  • Morton Order (Z-order) Encoding: Morton order is a space-filling curve that interleaves the bits of multiple coordinates, often used in computer graphics and spatial indexing structures to improve cache locality and memory access patterns.
  • Branchless Programming: Writing code without conditional branches can sometimes improve performance by avoiding branch prediction penalties. Techniques such as conditional moves (CMOV) or bitwise operations can be used to achieve this.
  • Loop Interchange: Reordering nested loops can sometimes improve data locality and reduce cache misses, leading to better performance. This technique is particularly useful when optimising matrix operations and other computational kernels.
  • Fused Multiply-Add (FMA): FMA instructions allow performing a multiplication and addition in a single operation, potentially reducing the number of instructions and improving performance for certain numerical computations.
  • SIMD Parallelisation: Single Instruction, Multiple Data (SIMD) instructions enable parallel processing of multiple data elements in a single instruction, which can be exploited to accelerate certain algorithms such as vector operations and image processing.
  • Memoization: Memoization is a technique used to cache the results of expensive function calls, avoiding redundant computations and improving performance for functions with overlapping or repetitive inputs.
  • Algorithmic Optimisations: Sometimes, rethinking the algorithm itself can lead to significant performance improvements. For example, using dynamic programming instead of brute force for certain problems, or employing approximation algorithms instead of exact solutions when precision isn’t critical.
  • Bit Twiddling Hacks: These are clever manipulation techniques that use bitwise operators to perform various tasks efficiently. For example, counting the number of set bits (popcount), finding the minimum or maximum of two integers without branching, etc.
  • SWAR (SIMD Within A Register) Algorithms: SWAR algorithms exploit SIMD instructions to perform parallel operations within a single CPU register. These algorithms are commonly used in cryptography, image processing, and other areas where parallel processing is beneficial.
  • Unrolling Loops: Unrolling loops manually reduces loop overhead by executing multiple loop iterations in a single iteration. This can improve performance by reducing branch prediction misses and loop control overhead.
  • Loop Fusion and Loop Tiling: Loop fusion combines multiple nested loops into a single loop to reduce loop overhead and improve cache locality. Loop tiling (also known as loop blocking) breaks down loops into smaller chunks to improve cache utilisation and reduce memory access latency.
  • Cache-conscious Data Structures: Designing data structures that are optimised for cache performance can significantly improve overall program performance. Examples include cache-friendly linked lists, B-trees, and hash tables.
  • Inline Assembly: Using inline assembly language code allows direct access to CPU instructions, which can be useful for implementing performance-critical sections of code that cannot be optimised using higher-level languages.
  • Precomputed Tables: Precomputing and storing values in tables can eliminate the need for costly computations at runtime. This is especially useful for algorithms that require expensive operations such as trigonometric functions or logarithms.
  • Data Compression Techniques: Using data compression techniques such as run-length encoding (RLE), Huffman coding, or arithmetic coding can reduce memory usage and improve cache performance in certain scenarios.
  • Magic Numbers: Similar to the magic constant used in the Fast Inverse Square Root algorithm (0x5f3759df), magic numbers are constants chosen for their unique properties that enable efficient computation. These numbers are often the result of empirical observation or mathematical analysis and can be used to optimise various algorithms.
  • Compiler Intrinsics: Compiler intrinsics are special functions provided by compilers that directly map to specific CPU instructions. By using intrinsics, you can take advantage of low-level CPU features and achieve extreme performance optimisations. Examples include SIMD intrinsics for parallel processing and memory prefetching intrinsics for improving cache performance.
  • Hand-Optimised Assembly Code: Writing performance-critical sections of code in assembly language allows you to fine-tune the instructions and register allocation for maximum efficiency. While this approach requires deep knowledge of the target architecture and is highly platform-dependent, it can yield significant performance gains in some cases.
  • Loop Unrolling with Compiler Pragmas: Some compilers provide pragmas or directives to control loop unrolling, allowing you to manually specify the degree of unrolling for optimal performance. This technique can reduce loop overhead and improve instruction-level parallelism, especially in tight computational loops.
  • SIMD Code Generation Tools: There are specialised tools and libraries available for generating SIMD code automatically from high-level descriptions of algorithms. These tools leverage compiler techniques, code generation algorithms, and domain-specific knowledge to produce highly optimised SIMD code tailored to specific hardware architectures.
  • Kernel-Level optimisation: In certain scenarios, extreme performance optimisations involve modifying the operating system kernel to customise scheduling policies, memory management strategies, or interrupt handling mechanisms. While this approach is highly advanced and requires expertise in kernel development, it can unlock significant performance improvements for specialised workloads.
  • Custom Hardware Accelerators: For ultra-high-performance computing tasks, designing custom hardware accelerators using Field-Programmable Gate Arrays (FPGAs) or Application-Specific Integrated Circuits (ASICs) can provide orders of magnitude improvement in performance compared to traditional software-based approaches. Custom hardware designs are tailored to specific algorithms and can exploit parallelism at the hardware level for extreme optimisation.