Build an AI Inferencing Solution With TensorRt and PyTorch

Traducciones al Español
Estamos traduciendo nuestros guías y tutoriales al Español. Es posible que usted esté viendo una traducción generada automáticamente. Estamos trabajando con traductores profesionales para verificar las traducciones de nuestro sitio web. Este proyecto es un trabajo en curso.
Create a Linode account to try this guide with a $ credit.
This credit will be applied to any valid services used during your first  days.

AI inference workloads are increasingly demanding, requiring low latency, high throughput, and cost-efficiency at scale. Whether working with computer vision or natural language AI models, processing power and efficiency are key; inference workloads must be able to handle real-time predictions while maintaining optimal resource utilization. Choosing the right infrastructure and optimization tools can dramatically impact both performance and operational costs.

This guide shows how to build and benchmark a complete AI inferencing solution using TensorRT and PyTorch on Akamai Cloud’s NVIDIA RTX 4000 Ada GPU instances. NVIDIA RTX 4000 Ada GPU instances are available across global core compute regions, delivering the specialized hardware required for heavy AI workloads. Using the steps in this guide, you can:

  • Deploy an RTX 4000 Ada GPU instance using Akamai Cloud infrastructure
  • Run an AI inference workload using PyTorch
  • Optimize your model with TensorRT for performance gains
  • Measure latency and throughput

The primary AI model used in this guide is a ResNet50 computer vision (CV) model. However, the techniques used can be applied to other model architectures like object detection (YOLO; You Only Look Once) models, speech recognition systems (OpenAI’s Whisper), and large language models (LLMs) like ChatGPT, Llama, or Claude.

GPU Plan Access
In some cases, a $100 deposit may be required to deploy GPU Linodes. This may include new accounts that have been active for less than 90 days and accounts that have spent less than $100 on services. If you are unable to deploy GPU Linodes, contact Support for assistance.

What are TensorRT and PyTorch?

TensorRt

TensorRT is an API and tool ecosystem by NVIDIA that includes inference compilers, runtimes, and deep learning model optimizations. TensorRT is trained on all major frameworks and is used to improve performance on NVIDIA GPUs using techniques like kernel auto-tuning, dynamic tensor memory management, and multi-stream execution. It directly integrates with PyTorch using the TensorRT Framework Integrations API to achieve up to 6x faster inferencing.

PyTorch

PyTorch is an open-source machine learning framework based on the Torch library and developed by Meta AI for training deep learning models. PyTorch is written in Python and integrates with TensorRT through Torch-TensorRT, so developers can optimize PyTorch models without changing existing codebases. PyTorch integrates with CUDA (Compute Unified Device Architecture) to take advantage of parallel computing architectures found in NVIDIA GPUs.

Before You Begin

The following prerequisites are recommended before starting the implementation steps in this tutorial:

  • An Akamai Cloud account with the ability to deploy GPU instances
  • The Linode CLI configured with proper permissions
  • An understanding of Python virtual environments and package management
  • General familiarity of deep learning concepts and models
Sudo Users & Distribution
This guide is written for a non-root user on the Ubuntu 24.04 LTS Linux distribution. Commands that require elevated privileges are prefixed with sudo. If you’re not familiar with the sudo command, see our Users and Groups doc.

Architecture Diagram

Deploy an NVIDIA RTX 4000 Ada Instance

Akamai’s NVIDIA RTX 4000 Ada GPU instances can be deployed using Cloud Manager or the Linode CLI.

Set Up Your Development Environment

Once your GPU is fully deployed, connect to your instance to update system packages and install system dependencies. It is recommended to first follow the steps in our Set up and secure a Linode guide to configure a limited user with sudo access and secure your sever.

Update Packages

  1. Log into your instance via SSH. Replace user with your sudo username and IP_ADDRESS with your Linode instance’s IP address:

    ssh user@IP_ADDRESS
  2. Update your system and install build tools and system dependencies:

    sudo apt update && sudo apt install -y \
        build-essential \
        gcc \
        wget \
        gnupg \
        software-properties-common \
        python3-pip \
        python3-venv
  3. Download and install NVIDIA CUDA keyring so you get the latest stable drivers and toolkits:

    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
  4. Update system packages after the keyring is installed:

    sudo apt update

Install NVIDIA Drivers and CUDA Toolkit

  1. Install the NVIDIA driver repository along with the latest drivers compatible with the RTX 4000 Ada card:

    sudo apt install -y cuda
  2. Reboot your instance to complete installation of the driver:

    sudo reboot
  3. After the reboot is complete, log back into your instance:

    ssh user@IP_ADDRESS
  4. Use the following command to verify successful driver installation:

    nvidia-smi

    This displays basic information about your RTX 4000 Ada instance and its driver version. Your driver and software versions may vary based on release date:

    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA RTX 4000 Ada Gene...    On  |   00000000:00:02.0 Off |                  Off |
    | 30%   35C    P8              4W /  130W |       2MiB /  20475MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |  No running processes found                                                             |
    +-----------------------------------------------------------------------------------------+

Configure Your Python Environment

Set up and use a Python Virtual Environment (venv) so that you can isolate Python packages and prevent conflicts with system-wide packages and across projects.

Create the Virtual Environment

  1. Using the python3-venv package downloaded during setup, set up the Python Virtual Environment:

    python3 -m venv ~/venv
    source ~/venv/bin/activate

    You can confirm you are using your virtual environment when you see (venv) at the beginning of your command prompt:

    (venv) user@hostname
  2. While in your virtual environment, upgrade pip to the latest version to complete the setup:

    (venv)
    pip install --upgrade pip

Install PyTorch and TensorRT

Remain in your virtual environment to install PyTorch, TensorRT, and dependencies. These are the primary AI libraries needed to run your inference workloads.

(venv)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install requests
pip install nvidia-pyindex
pip install nvidia-tensorrt
pip install torch-tensorrt -U

Test and Benchmark the ResNet50 Inference Model

Create and run a Python script using a pre-trained ResNet50 computer vision model. Running this script tests to make sure the environment is configured correctly while providing a way to evaluate GPU performance using a real-world example. This example script is a foundation that can be adapted for other inference model architectures.

  1. Using a text editor such as nano, create the Python script file. Replace inference_test.py with a script tile name of your choosing:

    nano inference_test.py
  2. Copy and insert the following code content into the script. Note the commented descriptions for what each section of code performs:

    File: inference_test.py
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    
    # import PyTorch, pre-trained models from torchvision and image utilities
    
    import torch
    import torchvision.models as models
    import torchvision.transforms as transforms
    from PIL import Image
    import requests
    from io import BytesIO
    import time
    
    # Download a sample image of a dog
    # You could replace this with a local file or different URL
    
    img_url = "https://github.com/pytorch/hub/raw/master/images/dog.jpg"
    image = Image.open(BytesIO(requests.get(img_url).content))
    
    # Preprocess
    # Resize and crop to match ResNet50’s input size
    # ResNet50 is trained on ImageNet where inputs are 224sx224 RGB
    # Convert to a tensor array so PyTorch can understand it
    # Use unsqueeze(0) to add a batch dimension, tricks model to think we are sending a batch of        # images
    # Use cuda() to move the data to the GPU
    
    transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
    ])
    input_tensor = transform(image).unsqueeze(0).cuda()
    
    # Load a model (ResNet50) pretrained on the ImageNet dataset containing millions of images
    
    model = models.resnet50(pretrained=True).cuda().eval()
    
    # Warm-up the GPU
    # Allows the GPU to optimize the necessary kernels prior to running the benchmark
    
    for _ in range(5):
        _ = model(input_tensor)
    
    # Benchmark Inference Time using an average time across 20 inference runs
    
    start = time.time()
    with torch.no_grad():
        for _ in range(20):
            _ = model(input_tensor)
    end = time.time()
    
    print(f"Average inference time: {(end - start) / 20:.4f} seconds")

    When complete, press Ctrl + X to exit nano, Y to save, and Enter to confirm.

  3. Run the Python script:

    python inference_test.py

    If everything works correctly, you should see output similar to the below. Time results may vary:

    Average inference time: 0.0025 seconds

    It is recommended to time how long it takes to run the model 20 times, and then divide by 20 to get the average time per inference. This should give you an idea of how quickly your GPU can process input using this model.

Next Steps

Try switching out ResNet50 for different model architectures available in torchvision.models, such as:

  • efficientnet_b0: Lightweight and accurate
  • vit_b_16: Vision Transformer model for experimenting with newer architectures

This can help you see how model complexity affects speed and accuracy.

More Information

You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.

This page was originally published on


Your Feedback Is Important

Let us know if this guide was helpful to you.


Join the conversation.
Read other comments or post your own below. Comments must be respectful, constructive, and relevant to the topic of the guide. Do not post external links or advertisements. Before posting, consider if your comment would be better addressed by contacting our Support team or asking on our Community Site.
The Disqus commenting system for Linode Docs requires the acceptance of Functional Cookies, which allow us to analyze site usage so we can measure and improve performance. To view and create comments for this article, please update your Cookie Preferences on this website and refresh this web page. Please note: You must have JavaScript enabled in your browser.