Skip to content
中文

Visual Language Model Introduction

Use Cases

A visual language model (VLM) accepts both images and text as input. With a VLM, you can combine an image with text instructions so the model can understand visual content together with the surrounding context and respond accordingly. Common scenarios include:

  • interpreting visual content such as objects, text, spatial relationships, colors, and scene mood
  • carrying out multi-turn conversations grounded in images
  • partially replacing traditional OCR and other computer-vision pipelines
  • preparing for more advanced applications such as visual agents and robotics

Usage

For VLM models, call /chat/completions with a message that contains either an image URL or a Base64-encoded image. Use the detail parameter to control image preprocessing.

2.1 Image Detail Control

SiliconCloud provides three detail options: low, high, and auto. For currently supported models, omitting detail or setting it to high uses high-resolution mode, while low or auto uses low-resolution mode.

2.2 Example Message with an Image URL

json
{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/outputs/658c7434-ec12-49cc-90e6-fe22ccccaf62_00001_.png",
        "detail": "high"
      }
    },
    {
      "type": "text",
      "text": "text-prompt here"
    }
  ]
}

2.3 Example Message with a Base64 Image

json
{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,{base64_image}",
        "detail": "low"
      }
    },
    {
      "type": "text",
      "text": "text-prompt here"
    }
  ]
}

2.4 Multi-Image Input

Each image can use either a public URL or a Base64 data URL.

json
{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/outputs/658c7434-ec12-49cc-90e6-fe22ccccaf62_00001_.png"
      }
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,{base64_image}"
      }
    },
    {
      "type": "text",
      "text": "text-prompt here"
    }
  ]
}

Note: the DeepseekVL2 series is intended for short contexts. It is recommended to pass at most two images. If more than two images are provided, the model automatically resizes them to 384*384 and ignores the specified detail value.

Billing for Visual Inputs

Visual inputs are converted into tokens and billed together with text. The exact conversion differs by model.

For OpenAI-style image billing:

  • detail: low costs 85 tokens per image
  • detail: high first fits the image into a 2048 x 2048 square while preserving aspect ratio, then resizes the shortest side to 768px, and finally calculates the number of 512px tiles
  • each tile costs 170 tokens, plus a fixed 85-token base cost

Examples:

  • 1024 x 1024 with detail: high costs 170 * 4 + 85 = 765
  • 2048 x 4096 with detail: high costs 170 * 6 + 85 = 1105
  • 4096 x 8192 with detail: low costs 85 tokens

Limitations

Known limitations include:

  • medical images: not suitable for interpreting CT scans or other specialized medical material
  • non-Latin text: recognition quality can drop for scripts such as Japanese or Korean
  • small text: enlarging the relevant area improves readability
  • rotation: upside-down or rotated content can be misread
  • visual elements: charts or diagrams with line-style differences may be hard to understand
  • spatial reasoning: tasks requiring exact positional reasoning can be unreliable
  • accuracy: some descriptions or titles may be incorrect
  • panoramas and fisheye images: performance is weaker
  • metadata: original filenames and metadata are not used
  • counting: only approximate counts should be expected
  • CAPTCHA: blocked for safety reasons

FAQ

Can GPT-4 with Vision generate images? No. Use image-generation models such as dall-e-3 for image creation, and use gpt-4o, gpt-4o-mini, or gpt-4-turbo for image understanding.

What file types are supported? PNG, JPEG, JPG, WEBP, and non-animated GIF.

Is there an upload size limit? Yes. The current upload limit is 20 MB per image.

Can uploaded images be deleted? They are deleted automatically after processing.

Where can I learn more? See the GPT-4 with Vision system card and the provider’s official documentation.