Skip to main content
Generate comprehensive, contextual captions for images using state-of-the-art vision-language models. Perfect for accessibility, content management, and automated image analysis workflows.
Image captioning example showing detailed scene description

Example image to be captioned.

Image caption
"A classic, light turquoise Volkswagen Beetle with chrome accents is parked on a cobblestone street, set against a warm yellow stucco wall with rustic brown wooden doors and windows."

Usage Example

import openai

# Initialize the OpenAI client
client = openai.OpenAI(
  base_url="https://agent.vlm.run/v1/openai",
  api_key="<VLMRUN_API_KEY>"
)

# Caption the image
response = client.chat.completions.create(
    model="vlm-agent-1",
    messages=[
        {"role": "user",
        "content": [
          {"type": "text", "text": "Generate a detailed caption for this image"},
          {"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.caption/car.jpg", "detail": "auto"}}
        ]
      }
    ],
)

# Print the response
print(response.choices[0].message.content)
>> "A classic, light turquoise Volkswagen Beetle with chrome accents is parked on a cobblestone street, set against a warm yellow stucco wall with rustic brown wooden doors and windows."

FAQ

You can ask simply ask for a more detailed caption by providing a more detailed prompt. In most cases, you can provide the number of words you want the caption to be, and the model will generate a more detailed caption.
  • Common Objects: person, car, truck, bus, bicycle, motorcycle
  • Scenes: street, building, park, forest, beach, etc.
  • Time-of-Day: morning, afternoon, evening, night
  • Weather: sunny, cloudy, rainy, snowing, etc.
The tags come in the format of a list of strings.