Inference Gateway: Migrating from v1alpha2 to v1 API¶

Introduction¶

This guide provides a comprehensive walkthrough for migrating your Inference Gateway setup from the alpha v1alpha2 API to the generally available v1 API. This document is intended for platform administrators and networking specialists who are currently using the v1alpha2 version of the Inference Gateway and want to upgrade to the v1 version to leverage the latest features and improvements.

Before you start the migration, ensure you are familiar with the concepts and deployment of the Inference Gateway.

Before you begin¶

Before starting the migration, it's important to determine if this guide is necessary for your setup.

Checking for Existing v1alpha2 APIs¶

To check if you are actively using the v1alpha2 Inference Gateway APIs, run the following command:

kubectl get inferencepools.inference.networking.x-k8s.io --all-namespaces

If this command returns one or more InferencePool resources, you are using the v1alpha2 API and should proceed with this migration guide.
If the command returns No resources found, you are not using the v1alpha2 InferencePool and do not need to follow this migration guide. You can proceed with a fresh installation of the v1 Inference Gateway.

Migration Paths¶

There are two paths for migrating from v1alpha2 to v1:

Simple Migration (with downtime): This path is for users who can afford a short period of downtime. It involves deleting the old v1alpha2 resources and CRDs before installing the new v1 versions.
Zero-Downtime Migration: This path is for users who need to migrate without any service interruption. It involves running both v1alpha2 and v1 stacks side-by-side and gradually shifting traffic.

Simple Migration (with downtime)¶

This approach is faster and simpler but will result in a brief period of downtime while the resources are being updated. It is the recommended path if you do not require a zero-downtime migration.

1. Delete Existing v1alpha2 Resources¶

Option a: Uninstall using Helm.

helm uninstall <helm_alpha_inferencepool_name>

Option b: Manually delete alpha InferencePool resources.

If you are not using Helm, you will need to manually delete all resources associated with your v1alpha2 deployment. The key is to remove the HTTPRoute's reference to the old InferencePool and then delete the v1alpha2 resources themselves.

Update or Delete the HTTPRoute: Modify the HTTPRoute to remove the backendRef that points to the v1alpha2 InferencePool.
Delete the InferencePool and associated resources: You must delete the v1alpha2 InferencePool, any InferenceModel resources that point to it, and the corresponding Endpoint Picker (EPP) Deployment and Service.

Delete the v1alpha2 CRDs: Once all v1alpha2 custom resources are deleted, you can remove the CRD definitions from your cluster.

kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml

2. Install v1 Resources¶

After cleaning up the old resources, you can proceed with a fresh installation of the v1 Inference Gateway. This involves installing the new v1 CRDs, creating a new v1 InferencePool and corresponding InferenceObjective resources, and creating a new HTTPRoute that directs traffic to your new v1 InferencePool.

3. Verify the Deployment¶

After a few minutes, verify that your new v1 stack is correctly serving traffic. You should have a PROGRAMMED gateway.

❯ kubectl get gateway -o wide
NAME                  CLASS                                ADDRESS         PROGRAMMED   AGE
inference-gateway   inference-gateway   <IP_ADDRESS>    True         10m

Curl the endpoint to make sure you are getting a successful response with a 200 response code.

IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
PORT=80

curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "<your_model>",
"prompt": "<your_prompt>",
"max_tokens": 100,
"temperature": 0
}'

Zero-Downtime Migration¶

This migration path is designed for users who cannot afford any service interruption. Assuming you already have the following stack shown in the diagram

Inference Gateway Alpha Stage

A Note on Interacting with Multiple API Versions¶

During the zero-downtime migration, both v1alpha2 and v1 CRDs will be installed on your cluster. This can create ambiguity when using kubectl to query for InferencePool resources. To ensure you are interacting with the correct version, you must use the full resource name:

For v1alpha2: kubectl get inferencepools.inference.networking.x-k8s.io
For v1: kubectl get inferencepools.inference.networking.k8s.io

The v1 API also provides a convenient short name, infpool, which can be used to query v1 resources specifically:

kubectl get infpool

This guide will use these full names or the short name for v1 to avoid ambiguity.

Stage 1: Side-by-side v1 Deployment¶

In this stage, you will deploy the new v1 InferencePool stack alongside the existing v1alpha2 stack. This allows for a safe, gradual migration.

After finishing all the steps in this stage, you’ll have the following infrastructure shown in the following diagram

Inference Gateway Migration Stage

1. Install v1 CRDs

RELEASE=v1.0.0
kubectl apply -f [https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/$RELEASE/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml](https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/$RELEASE/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml)

2. Install the v1 InferencePool

Use Helm to install a new v1 InferencePool with a distinct release name (e.g., vllm-llama3-8b-instruct-ga).

helm install vllm-llama3-8b-instruct-ga \
  --set inferencePool.modelServers.matchLabels.app=<the_label_you_used_for_the_model_server_deployment> \
  --set provider.name=<YOUR_PROVIDER> \
  --version $RELEASE \
  oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool

3. Create the v1 InferenceObjective

The v1 API replaces InferenceModel with InferenceObjective. Create the new resources, referencing the new v1 InferencePool.

kubectl apply -f - <<EOF
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceObjective
metadata:
  name: food-review
spec:
  priority: 1
  poolRef:
    group: inference.networking.k8s.io
    name: vllm-llama3-8b-instruct-ga
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceObjective
metadata:
  name: base-model
spec:
  priority: 2
  poolRef:
    group: inference.networking.k8s.io
    name: vllm-llama3-8b-instruct-ga
---
EOF

Stage 2: Traffic Shifting¶

With both stacks running, you can start shifting traffic from v1alpha2 to v1 by updating the HTTPRoute to split traffic. This example shows a 50/50 split.

1. Update HTTPRoute for Traffic Splitting

kubectl apply -f - <<EOF
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llm-route
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
  rules:
  - backendRefs:
    - group: inference.networking.x-k8s.io
      kind: InferencePool
      name: vllm-llama3-8b-instruct-alpha
      weight: 50
    - group: inference.networking.k8s.io
      kind: InferencePool
      name: vllm-llama3-8b-instruct-ga
      weight: 50
---
EOF

2. Verify and Monitor

After applying the changes, monitor the performance and stability of the new v1 stack. Make sure the inference-gateway status PROGRAMMED is True.

Stage 3: Finalization and Cleanup¶

Once you have verified that the v1 InferencePool is stable, you can direct all traffic to it and decommission the old v1alpha2 resources.

1. Shift 100% of Traffic to the v1 InferencePool

Update the HTTPRoute to send all traffic to the v1 pool.

kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llm-route
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
  rules:
  - backendRefs:
    - group: inference.networking.k8s.io
      kind: InferencePool
      name: vllm-llama3-8b-instruct-ga
      weight: 100
EOF

2. Final Verification

Send test requests to ensure your v1 stack is handling all traffic as expected.

Inference Gateway GA Stage

You should have a PROGRAMMED gateway:

❯ kubectl get gateway -o wide
NAME                  CLASS                                ADDRESS         PROGRAMMED   AGE
inference-gateway   inference-gateway   <IP_ADDRESS>    True         10m

Curl the endpoint and verify a 200 response code:

IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
PORT=80

curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "<your_model>",
"prompt": "<your_prompt>",
"max_tokens": 100,
"temperature": 0
}'

3. Clean Up v1alpha2 Resources

After confirming the v1 stack is fully operational, safely remove the old v1alpha2 resources.