Skip to main content
Detect objects, people, or other entities in images with precise bounding boxes and confidence scores. Ideal for inventory management, quality control, security applications, and automated visual inspection.

Object detections visualized.

Persons

Persons detection example showing detected people with bounding boxes

Faces

Faces detection example showing detected faces with bounding boxes

Example detections of objects, people and faces.

Usage Example

For object detection, we highly recommend using the Structured Outputs API to get the bounding boxes and confidence scores in a structured and validated data format.
The following examples can also be used for face or person detection. The response schema is identical to the object detection example.
import openai
from pydantic import BaseModel, Field

class Detection(BaseModel):
  label: str = Field(..., description="Name of the detected object")
  xywh: tuple[float, float, float, float] = Field(..., description="Bounding box of the detection (x, y, width, height)")

class Detections(BaseModel):
  detections: list[Detection] = Field(..., description="Detections of the detected objects")

# Initialize the client
client = openai.OpenAI(
    base_url="https://agent.vlm.run/v1/openai",
    api_key="<VLMRUN_API_KEY>"
)

# Detect objects in the image
response = client.chat.completions.create(
    model="vlm-agent-1",
    messages=[
        {
          "role": "user",
          "content": [
            {"type": "text", "text": "Detect all the donuts in this image"},
            {"type": "image_url", "image_url": {"url": "https://storage.googleapis.com/vlm-data-public-prod/hub/examples/image.object-detection/donuts.png", "detail": "auto"}}
          ]
        }
    ],
    response_format={"type": "json_schema", "schema": Detections.model_json_schema()},
)

# Print the response
print(response.choices[0].message.content)

# Validate the response
print(Detections.model_validate_json(response.choices[0].message.content))
>>> Detections(detections=[Detection(label="donut", xywh=(150, 200, 80, 75)), Detection(label="donut", xywh=(150, 200, 80, 75)), Detection(label="donut", xywh=(150, 200, 80, 75))])

FAQ

  • Common Objects: person, car, truck, bus, bicycle, motorcycle
  • Animals: dog, cat, bird, horse, cow, sheep
  • Food Items: apple, banana, sandwich, pizza, donut, cake
  • Electronics: laptop, phone, tv, keyboard, mouse
  • Furniture: chair, table, bed, sofa, desk
  • And 80+ more COCO dataset classes
The bounding boxes come in the format of normalized xywh, where x and y are the top-left corner of the bounding box, and w and h are the width and height of the bounding box. All values are between 0 and 1, and normalized by the image size. x and w are normalized by the image width, and y and h are normalized by the image height.
The confidence score is a value between 0 and 1 that indicates the confidence of the detection.