Blog/Product
How to accelerate LLaMA 3 fine-tuning with FP8
This tutorial was authored by Anjali Shah, Sudhakar Singh, and Santosh Bhavani from NVIDIA, and formatted by the Latitude.sh team.
LLaMA 3 introduces several improvements over LLaMA 2. Trained on a significantly larger dataset (15T tokens vs. 2.2T tokens), the latest iteration of Meta’s large language model supports a longer context length of up to 8K tokens compared to its predecessor.
In this tutorial, you will learn how to fine-tune LLaMA 3 with FP8 precision using Hugging Face Accelerate, PEFT, and Transformer Engine.
For deploying and managing the necessary infrastructure, we'll use the Latitude.sh bare metal platform, which offers automated solutions for AI workloads and predefined templates.
SUMMARY
Introduction to Transformer Engine
Transformer Engine (TE) is a library that accelerates Transformer model training on NVIDIA GPUs. It supports FP8 precision on Hopper and Ada GPUs, enhancing performance while maintaining accuracy during training and inference.
TE provides optimized building blocks for popular Transformer architectures and an automatic mixed precision API compatible with major deep learning frameworks.
Setting up the environment
To launch a JupyterLab notebook on the Latitude.sh platform, follow these steps:
1. Access the Latitude.sh dashboard, and navigate to "Serverless" in the sidebar
2. Select the blueprint
Latitude.sh allows you to deploy containers using blueprints. You can either select an existing blueprint or create a new one tailored to your needs.
To make it easy to use all the required libraries and ensure the best performance and compatibility with NVIDIA GPUs, we'll use the pre-built PyTorch + NGC FP8 blueprint. This launches an NVIDIA DL container for PyTorch which includes the latest optimized versions of PyTorch and Transformer Engine.
To learn more about the container visit NVIDIA’s NGC Catalog.
3. Deploy the container
Once you click Deploy, you are ready to launch your container. This will provision the hardware and install the necessary software as specified in the blueprint.
Size: FP8 is currently supported only on Hopper and Ada GPUs. Ensure you select a compatible GPU architecture, such as H100 or L40S.
Environment variables: In order to SSH into your instance, add your SSH public key in the environment variable for PUBLIC_KEY. Refer to Latitude.sh docs for more details.
4. SSH into the instance
Once your instance is deployed, copy the IP address, open a terminal and enter this command:
bash
ssh -L 8888:localhost:8888 root@<ip_addr>
This creates an SSH tunnel and mapping port 8888 in your remote machine with your host machine port 8888. We do this to enable access to JupyterLab from our local machine's browser.
5. Launch JupyterLab Navigate to the directory where you want to work and start JupyterLab by running:
bash
pip install protobuf==3.20.*
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
jupyter lab
This command will start the JupyterLab server. You can now navigate to your browser and copy the URL from your terminal replacing ‘hostname’ with ‘localhost’ or IP address of your host server.
Fine-tuning LLaMa 3 with FP8
Now that your environment is set, let's explore how to accelerate LLaMA 3 finetuning using NVIDIA's Transformer Engine (TE) by following the steps outlined in the Jupyter notebook available here.
Note: This notebook has dependencies on utils.py and te_llama.py python scripts that should be downloaded along with the notebook.
Step 1: Import necessary libraries
First, let's import the required libraries:
To get started, ensure you have access to the necessary libraries and tools. Install the required packages, including `transformers`
, `accelerate`
, and `peft`
:
python
!pip install transformers accelerate peft datasets huggingface_hub ipywidgets
Step 2: Load the LLaMA Model and Tokenizer
We will load the LLaMA model and its tokenizer from the Hugging Face model hub by providing Access Token:
python
hyperparams.hf_access_token = ""
Step 3: Replace Transformer layers with TE layers
To leverage NVIDIA's Transformer Engine, we need to replace the standard transformer layers in the LLaMA model with TE layers. Here's how you can do it:
python
for i, layer in enumerate(model.model.layers):
model.model.layers[i] = TransformerLayer(layer)
Step 4: Performance evaluation
After setting up and running your model with TE, it's essential to evaluate its performance.
Conclusion
In this tutorial, we showed how to speed up the Hugging Face LLaMA model using NVIDIA’s Transformer Engine. By swapping standard transformer layers with TE layers, we achieved significant performance boosts in training and inference, making it ideal for large language models.
Feel free to experiment further with different configurations and datasets to fully harness the power of NVIDIA's Transformer Engine. Happy coding!