Hello there, aspiring MLOps engineer or data scientist! Are you ready to take your powerful PyTorch models from local development to a scalable, production-ready environment? Deploying machine learning models can often feel like a complex journey, but with Google Cloud's Vertex AI, it becomes a streamlined and efficient process. This comprehensive guide will walk you through every step of deploying your PyTorch models on Vertex AI, ensuring you have the knowledge and confidence to make your models accessible for real-world applications.
Let's dive in and unlock the potential of your PyTorch creations on the cloud!
How to Deploy PyTorch Models on Vertex AI: A Step-by-Step Guide
Deploying a PyTorch model on Vertex AI typically involves preparing your model artifacts, creating a serving container (or using a pre-built one), uploading the model to the Vertex AI Model Registry, and finally deploying it to an endpoint for predictions.
How To Deploy Pytorch Models On Vertex Ai |
Step 1: Preparing Your PyTorch Model for Deployment
Before we even think about Vertex AI, we need to ensure our PyTorch model is ready to be served. This is a critical first step that often gets overlooked.
1.1. Saving Your PyTorch Model
The most common way to save a PyTorch model for deployment is using torch.jit.script
or torch.jit.trace
to convert your model into a TorchScript format. TorchScript is a way to create serializable and optimizable models from PyTorch code.
TorchScript with
torch.jit.script
: This is generally preferred for more complex models, as it performs static analysis on your Python code to convert it into a TorchScript graph.Pythonimport torch import torch.nn as nn # Assume MyModel is your PyTorch model definition class MyModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 1) def forward(self, x): return torch.sigmoid(self.linear(x)) model = MyModel() # Load your trained weights here if applicable # model.load_state_dict(torch.load('my_model_weights.pth')) model.eval() # Set model to evaluation mode # Create a dummy input for scripting example_input = torch.randn(1, 10) # Script the model scripted_model = torch.jit.script(model, example_inputs=(example_input,)) scripted_model.save("my_model.pt") print("Model saved as my_model.pt (TorchScript)")
TorchScript with
torch.jit.trace
: This method records the operations that happen on a given input and constructs a graph. It's simpler but might not capture all control flow if your model has dynamic behavior.Pythonimport torch import torch.nn as nn class MyModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 1) def forward(self, x): return torch.sigmoid(self.linear(x)) model = MyModel() model.eval() example_input = torch.randn(1, 10) traced_model = torch.jit.trace(model, example_input) traced_model.save("my_model_traced.pt") print("Model saved as my_model_traced.pt (TorchScript traced)")
Why TorchScript? TorchScript models are optimized for deployment, are language-agnostic (can be run without Python), and are often more performant in production environments.
1.2. Creating a Custom Handler (if needed)
For many PyTorch models, especially those requiring specific preprocessing or postprocessing, you'll need to create a custom handler script. This script tells the serving container how to load your model and process incoming prediction requests. Vertex AI typically uses TorchServe for PyTorch model serving.
Your handler file (e.g., handler.py
) needs to define a class that inherits from BaseHandler
(provided by TorchServe) and implement key methods:
initialize(context)
: Loads your model and any other necessary assets (like a tokenizer or vocabulary) from the model directory into memory. This is called once when the model is loaded.preprocess(data)
: Takes the raw input data from the prediction request and prepares it for your model.inference(data)
: Performs the actual inference using your loaded model.postprocess(inference_output)
: Takes the model's raw output and formats it into the desired prediction response.
Here's a simplified example of what a handler.py
might look like:
# handler.py
from ts.torch_handler.base_handler import BaseHandler
import torch
import json
import logging
# Configure logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
class MyCustomHandler(BaseHandler):
def __init__(self):
super().__init__()
self.initialized = False
def initialize(self, context):
properties = context.system_properties
model_dir = properties.get("model_dir")
self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")
# Load the TorchScript model
self.model = torch.jit.load(f"{model_dir}/my_model.pt")
self.model.to(self.device)
self.model.eval() # Set to evaluation mode
# Load any other necessary artifacts (e.g., tokenizer, label mappings)
# with open(f"{model_dir}/labels.json", 'r') as f:
# self.labels = json.load(f)
logger.info(f"Model loaded successfully on {self.device}")
self.initialized = True
def preprocess(self, data):
# Data is a list of dictionaries, each representing an instance
inputs = []
for row in data:
# Assuming input is a JSON with a 'data' field containing a list of numbers
input_data = row.get("data") or row.get("body")
if isinstance(input_data, bytes):
input_data = input_data.decode('utf-8')
# Example: Convert string to tensor (adjust based on your actual input format)
try:
input_tensor = torch.tensor(json.loads(input_data), dtype=torch.float32)
inputs.append(input_tensor)
except Exception as e:
logger.error(f"Error processing input data: {e}")
raise
# Stack all inputs into a single batch tensor
return torch.stack(inputs).to(self.device)
def inference(self, data):
with torch.no_grad():
output = self.model(data)
return output
def postprocess(self, inference_output):
# Example: Convert tensor output to a Python list
predictions = inference_output.tolist()
return predictions
1.3. Packaging Model Artifacts with Torch Model Archiver
To deploy your model with TorchServe on Vertex AI, you need to package your model and the handler (if applicable) into a single .mar
(Model ARchive) file. This is done using the torch-model-archiver
tool.
First, ensure you have it installed:
pip install torch-model-archiver
Then, create the .mar
file:
torch-model-archiver --model-name my_pytorch_model \
--version 1.0 \
--model-file my_model.pt \
--handler handler.py \
--extra-files "labels.json" \ # Include any other necessary files
--export-path model_artifacts \
--force # Overwrite if file exists
--model-name
: A name for your model.--version
: Version of your model.--model-file
: Path to your TorchScript model (.pt
or.pth
).--handler
: Path to your custom handler script (handler.py
). If you don't need a custom handler and are using a pre-built TorchServe handler, you might omit this or specify a default one.--extra-files
: Comma-separated list of additional files your handler needs (e.g.,labels.json
,tokenizer.json
).--export-path
: Directory where the.mar
file will be saved.--force
: Overwrite existing.mar
file.
This will create a my_pytorch_model.mar
file within the model_artifacts
directory. This .mar
file is your deployable unit.
QuickTip: Read again with fresh eyes.
Step 2: Storing Model Artifacts in Google Cloud Storage (GCS)
Vertex AI needs access to your model artifacts. The most common and recommended way to do this is by uploading your .mar
file to a Google Cloud Storage (GCS) bucket.
2.1. Set Up Your Google Cloud Project and GCS Bucket
Create a Google Cloud Project: If you don't have one, create a new project in the Google Cloud Console.
Enable APIs: Ensure the "Vertex AI API" and "Cloud Storage API" are enabled for your project.
Create a GCS Bucket:
Navigate to "Cloud Storage" -> "Buckets" in the Google Cloud Console.
Click "Create bucket".
Give it a globally unique name (e.g.,
your-project-id-pytorch-models
).Choose a region (e.g.,
us-central1
). It's generally best practice to choose a region close to where your Vertex AI endpoint will be deployed.Select a storage class (Standard is usually fine).
Click "Create".
2.2. Upload Your .mar
File to GCS
You can upload your .mar
file using the Google Cloud Console, gsutil
command-line tool, or the Google Cloud Client Libraries (Python SDK).
Using
gsutil
(Recommended for scripts):Bash# Replace with your bucket name BUCKET_NAME="your-project-id-pytorch-models" MODEL_ARCHIVE_PATH="model_artifacts/my_pytorch_model.mar" GCS_MODEL_URI=f"gs://{BUCKET_NAME}/pytorch_models/my_pytorch_model.mar" gsutil cp {MODEL_ARCHIVE_PATH} {GCS_MODEL_URI} echo "Model uploaded to: {GCS_MODEL_URI}"
Using Google Cloud Console:
Go to your GCS bucket.
Click "Upload files" and select your
my_pytorch_model.mar
file.
Step 3: Creating a Vertex AI Model Resource
Now that your model artifacts are in GCS, you can create a "Model" resource in Vertex AI. This resource essentially points to your model artifacts and specifies the serving container image.
3.1. Choose Your Prediction Container Image
Vertex AI provides pre-built container images for PyTorch, which simplifies deployment significantly. These images come with TorchServe pre-installed. You'll need to select the appropriate image based on your PyTorch version and whether you need GPU support.
Example URIs for pre-built PyTorch prediction images:
CPU:
us-docker.pkg.dev/vertex-ai/prediction/pytorch-cpu.1-12:latest
(adjust version as needed)GPU:
us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.1-12:latest
(adjust version as needed)
For the most up-to-date list of pre-built images, refer to the official Google Cloud documentation on Vertex AI pre-built containers.
3.2. Create the Model Resource using Vertex AI SDK for Python
The Vertex AI SDK for Python is a powerful tool for programmatic interaction with Vertex AI.
First, install it: pip install google-cloud-aiplatform
Then, use the following Python code:
QuickTip: Focus on one paragraph at a time.
from google.cloud import aiplatform
# --- Configuration ---
PROJECT_ID = "your-gcp-project-id" # Replace with your project ID
LOCATION = "us-central1" # Choose your region
BUCKET_NAME = "your-project-id-pytorch-models" # Your GCS bucket
MODEL_DISPLAY_NAME = "my-pytorch-sentiment-model"
MODEL_DESCRIPTION = "PyTorch model for sentiment analysis"
MODEL_URI = f"gs://{BUCKET_NAME}/pytorch_models/my_pytorch_model.mar" # Path to your .mar file
PREDICTION_IMAGE_URI = "us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.1-12:latest" # Or CPU image
# Initialize Vertex AI SDK
aiplatform.init(project=PROJECT_ID, location=LOCATION)
# Create the Model resource
model = aiplatform.Model.upload(
display_name=MODEL_DISPLAY_NAME,
artifact_uri=MODEL_URI,
serving_container_image_uri=PREDICTION_IMAGE_URI,
description=MODEL_DESCRIPTION,
# You can add more configurations here, e.g., environment variables for your handler
# serving_container_environment_variables={"MAX_BATCH_SIZE": "8"}
)
print(f"Model uploaded. Resource Name: {model.resource_name}")
print(f"Model ID: {model.name}")
Important: Make sure your service account (or user account if running locally) has the necessary permissions (e.g.,
Vertex AI User
,Storage Object Admin
).
You can also create the model resource through the Google Cloud Console:
Navigate to "Vertex AI" -> "Models".
Click "Import".
Select "Import as new model".
Provide a display name.
For "Model settings", choose "Importing as existing custom container".
Specify the "Cloud Storage path to model artifacts" (your
gs://bucket-name/path/to/my_pytorch_model.mar
).Enter the "Docker image name" (your chosen
PREDICTION_IMAGE_URI
).Configure prediction and health routes (default
/predictions
and/ping
for TorchServe).Click "Import".
Step 4: Deploying Your Model to an Endpoint
A Vertex AI Model resource itself doesn't serve predictions. You need to deploy it to an "Endpoint," which provisions the necessary compute resources (VMs, GPUs) and exposes a stable API for inference.
4.1. Create an Endpoint Resource
An endpoint is a dedicated resource that hosts one or more deployed models.
# Create an Endpoint
endpoint = aiplatform.Endpoint.create(
display_name=f"{MODEL_DISPLAY_NAME}-endpoint",
description=f"Endpoint for {MODEL_DISPLAY_NAME}"
)
print(f"Endpoint created. Resource Name: {endpoint.resource_name}")
print(f"Endpoint ID: {endpoint.name}")
4.2. Deploy the Model to the Endpoint
Now, associate your model with the newly created endpoint. This is where you specify the machine type and accelerator (GPU) configuration for serving.
# --- Deployment Configuration ---
MACHINE_TYPE = "n1-standard-4" # Or a GPU-enabled machine type like 'n1-standard-8'
ACCELERATOR_TYPE = "NVIDIA_TESLA_T4" # Or "NVIDIA_TESLA_P100", "NVIDIA_TESLA_V100", "NVIDIA_TESLA_A100"
ACCELERATOR_COUNT = 1 # Number of GPUs
# Deploy the model
deployed_model = endpoint.deploy(
model=model,
deployed_model_display_name=f"{MODEL_DISPLAY_NAME}-deployed",
machine_type=MACHINE_TYPE,
accelerator_type=ACCELERATOR_TYPE,
accelerator_count=ACCELERATOR_COUNT,
min_replica_count=1, # Minimum number of serving replicas
max_replica_count=1, # Maximum number of serving replicas (for auto-scaling)
sync=True # Wait for deployment to complete
)
print(f"Model deployed to endpoint. Deployed Model ID: {deployed_model.id}")
print(f"Endpoint public DNS name: {endpoint.public_endpoint_domain_name}")
Note on Machine Types and Accelerators: Choose machine types and accelerators that match your model's computational requirements and your budget. GPU instances are significantly more expensive but offer much higher inference throughput for deep learning models.
Scaling:
min_replica_count
andmax_replica_count
allow Vertex AI to auto-scale your endpoint based on traffic. For testing, keeping both at 1 is fine.
Deployment can take 15-20 minutes or even longer depending on the machine type and complexity. The sync=True
parameter will make your script wait until the deployment is complete.
You can also deploy via the Google Cloud Console:
Navigate to "Vertex AI" -> "Endpoints".
Click "Create Endpoint".
Give it a display name.
Once created, click on the endpoint and then click "Deploy Model".
Select your previously uploaded model.
Configure the machine type, accelerator type, and scaling settings.
Click "Deploy".
Step 5: Getting Predictions from Your Deployed Model
Once your model is deployed and the endpoint is active, you can send prediction requests.
5.1. Send Online Prediction Requests
import json
import torch
# Prepare your input data for prediction
# This should match the format your handler expects in its preprocess method
# Example for a model expecting a list of floats as input:
instances = [
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
[1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1]
]
# Send the prediction request
response = endpoint.predict(instances=instances)
print("Predictions:")
for prediction in response.predictions:
print(prediction)
The
instances
list should contain a list of inputs, where each input corresponds to a single inference request that yourpreprocess
method can handle.The
response.predictions
will contain the output from yourpostprocess
method.
5.2. Perform Batch Predictions (Optional)
For large datasets where real-time inference isn't required, batch prediction is a more cost-effective option. You provide input data in a GCS bucket, and Vertex AI processes it and stores the predictions in another GCS bucket.
Upload input data to GCS: Create a JSON Lines file (
.jsonl
) where each line is a JSON object representing an input instance.JSON# input.jsonl {"data": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]} {"data": [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1]}
Upload this file to a GCS bucket.
Create a batch prediction job:
Pythonbatch_input_uri = f"gs://{BUCKET_NAME}/batch_input/input.jsonl" batch_output_uri = f"gs://{BUCKET_NAME}/batch_output/" batch_job = aiplatform.BatchPredictionJob.create( job_display_name="my-pytorch-batch-prediction", model_name=model.resource_name, instances_format="jsonl", predictions_format="jsonl", gcs_source=[batch_input_uri], gcs_destination_prefix=batch_output_uri, machine_type="n1-standard-4", # Or a GPU machine type accelerator_type=ACCELERATOR_TYPE, accelerator_count=ACCELERATOR_COUNT, sync=True ) print(f"Batch prediction job created: {batch_job.resource_name}") batch_job.wait_for_resource_creation() print(f"Batch prediction job state: {batch_job.state}")
QuickTip: Pause to connect ideas in your mind.
Step 6: Monitoring and Management
Once your model is deployed, monitoring its performance and resource utilization is crucial.
6.1. Vertex AI Model Monitoring
Vertex AI offers model monitoring capabilities to detect data drift, concept drift, and prediction quality issues. While primarily designed for tabular data, you can integrate custom logging to monitor specific metrics for your PyTorch models.
Enable model monitoring when deploying your model or later configure it from the Google Cloud Console under "Vertex AI" -> "Model Monitoring".
6.2. Logging and Metrics
Your serving container automatically sends logs to Cloud Logging. You can use these logs to debug issues and gain insights into your model's behavior. Vertex AI also provides metrics on endpoint usage, latency, and error rates in Cloud Monitoring.
6.3. Undeploying and Cleaning Up
To avoid incurring unnecessary costs, remember to undeploy your model from the endpoint when it's no longer needed, and then delete the endpoint and model resources.
# Undeploy the model from the endpoint
endpoint.undeploy_all()
print("Model undeployed.")
# Delete the endpoint
endpoint.delete()
print("Endpoint deleted.")
# Delete the model resource
model.delete()
print("Model resource deleted.")
# Optionally, delete the GCS bucket if it's no longer needed
# gsutil rm -r gs://your-project-id-pytorch-models
10 Related FAQ Questions
How to export a PyTorch model to TorchScript?
To export a PyTorch model to TorchScript, use torch.jit.script(model, example_inputs=(example_input,))
for models with control flow, or torch.jit.trace(model, example_input)
for models that are purely sequential. Save the output using .save("model.pt")
.
How to create a custom TorchServe handler for Vertex AI?
Create a Python file (e.g., handler.py
) with a class inheriting from ts.torch_handler.base_handler.BaseHandler
. Implement initialize
, preprocess
, inference
, and postprocess
methods to define your model's loading and serving logic.
How to package PyTorch model artifacts for Vertex AI deployment?
QuickTip: Treat each section as a mini-guide.
Use the torch-model-archiver
tool to create a .mar
file. The command typically looks like torch-model-archiver --model-name <name> --version <version> --model-file <model.pt> --handler <handler.py> --export-path <output_dir>
.
How to upload model artifacts to Google Cloud Storage?
Use the gsutil cp
command-line tool (e.g., gsutil cp local/path/to/model.mar gs://your-bucket-name/path/
) or the GCS console to upload your .mar
file.
How to choose the right pre-built PyTorch container for Vertex AI prediction?
Select the pre-built container image that matches your PyTorch version and whether your model requires a CPU or GPU for inference (e.g., us-docker.pkg.dev/vertex-ai/prediction/pytorch-cpu.1-12:latest
or pytorch-gpu.1-12:latest
). Refer to Google Cloud documentation for the latest versions.
How to deploy a PyTorch model to a Vertex AI endpoint using the SDK?
First, upload your model using aiplatform.Model.upload()
, providing the GCS URI of your .mar
file and the serving_container_image_uri
. Then, create an endpoint with aiplatform.Endpoint.create()
and deploy your model to it using endpoint.deploy()
, specifying machine type and accelerator.
How to send online prediction requests to a Vertex AI PyTorch endpoint?
After deployment, use the endpoint.predict(instances=...)
method from the Vertex AI SDK, passing your input data as a list of instances that your custom handler's preprocess
method expects.
How to perform batch predictions with a PyTorch model on Vertex AI?
Upload your input data (e.g., JSON Lines file) to GCS. Then, create a aiplatform.BatchPredictionJob.create()
job, pointing to your model, input GCS URI, and desired output GCS URI.
How to monitor PyTorch models deployed on Vertex AI?
Utilize Vertex AI Model Monitoring to track data drift and concept drift (primarily for tabular data). For PyTorch, rely on Cloud Logging for container logs and Cloud Monitoring for endpoint metrics like latency and error rates.
How to clean up Vertex AI resources after PyTorch model deployment?
To stop incurring costs, first endpoint.undeploy_all()
, then endpoint.delete()
, and finally model.delete()
. If you created a dedicated GCS bucket, you can also delete it using gsutil rm -r gs://your-bucket
.
This page may contain affiliate links — we may earn a small commission at no extra cost to you.
💡 Breath fresh Air with this Air Purifier with washable filter.