Managed Inference Job

A Managed Inference Job runs an open-source model inside a KubeVirt virtual machine instance (VMI) on your cluster and serves it behind an OpenAI-compatible API. It runs one of two runtimes, vLLM for language models or Parakeet for speech-to-text. You send requests the same way you would to any OpenAI-compatible endpoint.

When to use one

A Managed Inference Job fits when you want to call a language model over an API and let CosmicAC run it for you. You pick an open-source model, and CosmicAC serves it.

If you instead want direct control of a GPU to run your code, a GPU Container Job is the better fit. It gives you a machine and a shell, and you set up the environment yourself.

What you get

An OpenAI-compatible API — your existing clients and SDKs work without changes.
Open-source models — served on your cluster with vLLM or Parakeet.
Managed serving — CosmicAC provisions and runs the model server, so you don't set up the runtime or GPU environment yourself.

Runtimes

A Managed Inference Job runs one of two runtimes.

vLLM — serves open-source language models behind an OpenAI-compatible chat endpoint. You call it with chat completions.
Parakeet — serves the NVIDIA Parakeet speech-to-text model nvidia/parakeet-tdt-0.6b-v3 behind an OpenAI-compatible transcription endpoint. You call it by uploading an audio file.

Supported models

A vLLM job serves any model that vLLM supports. You identify the model by its Hugging Face model ID, such as Qwen/Qwen3-32B. Browse the Hugging Face model hub or the vLLM supported models list to find one. For some models, CosmicAC recommends specific serving parameters and hardware. See Recommended model parameters.

A Parakeet job serves one speech-to-text model, nvidia/parakeet-tdt-0.6b-v3.

Model masters

A model master holds the defined data CosmicAC knows about a model, such as its supported runtime image and default serving parameters. It saves you from configuring every field each time you serve that model.

When you select a model while creating a Managed Inference Job, its model master prefills the Serving configuration. Tunable fields appear in the form for you to adjust; other values stay fixed. CosmicAC also merges the master's overrides, such as the root disk size and environment variables, into the job.

An admin maintains one model master per model. You currently manage them with requests to the app-node API. See Create a model master and Recommended model parameters.

How it works

A Managed Inference Job moves through a short lifecycle. When you create it from the CLI or the web UI, CosmicAC schedules it on a GPU node in your cluster. CosmicAC then provisions a VMI that serves the model with the chosen runtime. Once the model is serving, you call it through the OpenAI-compatible endpoint, which authenticates your requests and routes them to the running model. Restarting the job replaces its VMI while keeping its storage and resources. Deleting it removes its VMI, resources, and storage.

For the component-level path a request takes through CosmicAC, see Architecture.

How you connect

You call the model in two ways. You can send requests from any OpenAI-compatible client, or run inference directly with cosmicac-cli. Both authenticate with an API key.

For the steps to connect a client, see Connect to a Managed Inference endpoint (vLLM). For a Parakeet endpoint, you upload an audio file instead. See Transcribe audio with a Parakeet endpoint.