AI's FLOPS Show!

FLOPS are fundamental to understanding AI performance, helping us grasp how AI hardware and precision choices impact cost, speed, and efficiency.

RAHUL MATHUR

05 Nov 2024 — 3 min read

AI is undeniably shaping our future and is here to stay. Why FLOP? Because, it's FLOPS :)

Well, in this article, we’ll be talking about of FLOPS, or 'Floating-point Operations Per Second,' a crucial measure of computational power that underpins AI hardware and performance. Understanding FLOPS helps us grasp the scale and speed needed for the intensive calculations that drive AI, from training complex models to delivering real-time inferences.

Background

Traditional performance metrics like CPU clock speed (GHz) and the number of instructions in a program provided a straightforward way to gauge computational power. However, with the rise of AI and machine learning, these metrics have become insufficient.

FLOPS, or floating-point operations per second, is a measure of computer performance that is particularly useful in fields of scientific computations requiring floating-point calculations. It is a more accurate measure than instructions per second for such cases. Today, the performance of hardware for AI applications is primarily evaluated using FLOPS.

The shift towards AI has driven the need for a fundamentally different type of processing capability. AI applications involve vast amounts of floating-point calculations, which require high precision to capture the nuances of real-world data. This shift from simple integer-based calculations to sophisticated floating-point operations is why FLOPS has become the gold standard in measuring AI hardware performance.

Floating Point Precision and Its Types

The most common floating-point precision formats are Half-precision (FP16), Single-precision (FP32), and Double-precision (FP64). The floating-point representation uses the IEEE 754 standard. To understand basic of floating point in arithmetic, I suggest that you check this representation.

Following are the popular precisions -

FP32 : In the context of deep learning, FP32 became the default precision for training neural networks primarily due to its balance between precision and computational efficiency. Early machine learning models required significant numerical accuracy to prevent issues such as exploding or vanishing gradients, which can derail the training process. As a result, FP32 was favored for its ability to represent small and large numbers with enough fidelity.
FP16 : Allows for faster computations due to its reduced bit width compared to FP32 (single precision). With only 16 bits to represent floating-point numbers, operations can be executed more quickly. This speed increase is particularly beneficial in training large neural networks, where matrix multiplications are prevalent. For instance, many modern GPUs, like NVIDIA's A100 and V100, are optimized to perform FP16 calculations natively, allowing them to execute twice as many operations per clock cycle compared to FP32.
TF32: Introduced with NVIDIA’s Ampere architecture, specifically in the A100 GPUs, TF32 (Tensor Float 32) is a floating-point format designed to strike a balance between the precision of FP32 (Single Precision) and the performance benefits of FP16 (Half Precision).
BF16 : Brain Floating Point is an innovative floating-point format designed to optimize computation speed while managing precision loss. This format has gained popularity in the realm of deep learning, particularly in training large language models (LLMs) such as GPT-3 and BERT. Many deep learning frameworks, including TensorFlow and PyTorch, support BF16 natively, making it easier for developers to adopt this format without a complete overhaul of their code.

FLOPS and AI Model Cost Estimation

Understanding FLOPS (Floating Point Operations per Second) is critical in assessing the computational needs for training and running AI models. Models with higher FLOPS demand more computing power, which can impact both cost and processing speed. This is particularly relevant in determining throughput, as higher FLOPS may mean slower processing if adequate resources are not allocated.

Model Complexity: FLOPS also give insight into the complexity of a model. Generally, models with higher FLOPS counts tend to be more intricate, requiring more layers, parameters, or mathematical operations. This helps in selecting appropriate infrastructure and in anticipating performance.

Training and Inference Time: By knowing the FLOPS required, one can roughly estimate training duration and inference time. Higher FLOPS often correlate with longer training and inference periods, so understanding these values can aid in planning both development timelines and deployment readiness.

Example

Let's assume a model that requires approximately 175 billion parameters and used 15T tokens
Suppose each forward pass and iteration through the model requires around 6 FLOPS per parameter( guidance as per Stanford lecture).

Total FLOPS = Parameters × Tokens × FLOPS per parameter
= 6 x 175e9 x 15e12 = 1.5e25

To estimate time, we need to know the FLOPS capability of the hardware. For instance, NVIDIA A100 GPUs can deliver approximately 400 TFLOPS (teraflops) in FP16 precision.
Hence, with 10K GPUs following will be the compute time

Compute Time = (Total FLOPS) / (GPU FLOPS)

= 1.5e25/(400e12 x 3600 x 24 x 10e3) = 45.6 days

Conclusion

Understanding FLOPs is essential not only for assessing the cost and performance of AI model training and inference but also for selecting the appropriate floating-point precision based on specific use cases. For instance, training a model in FP32 can maximize accuracy, while deploying it for inference in FP16 optimizes performance and reduces resource consumption. Or you can use frameworks that provide mixed precisions to host youe models. Balancing FLOPs with precision choices allows for an efficient model lifecycle that aligns with both computational demands and desired outcomes.