Visual Language Model Introduction

Use Cases

A visual language model (VLM) accepts both images and text as input. With a VLM, you can combine an image with text instructions so the model can understand visual content together with the surrounding context and respond accordingly. Common scenarios include:

interpreting visual content such as objects, text, spatial relationships, colors, and scene mood
carrying out multi-turn conversations grounded in images
partially replacing traditional OCR and other computer-vision pipelines
preparing for more advanced applications such as visual agents and robotics

Usage

For VLM models, call /chat/completions with a message that contains either an image URL or a Base64-encoded image. Use the detail parameter to control image preprocessing.

2.1 Image Detail Control

SiliconCloud provides three detail options: low, high, and auto. For currently supported models, omitting detail or setting it to high uses high-resolution mode, while low or auto uses low-resolution mode.

2.2 Example Message with an Image URL

json

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/outputs/658c7434-ec12-49cc-90e6-fe22ccccaf62_00001_.png",
        "detail": "high"
      }
    },
    {
      "type": "text",
      "text": "text-prompt here"
    }
  ]
}

2.3 Example Message with a Base64 Image

json

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,{base64_image}",
        "detail": "low"
      }
    },
    {
      "type": "text",
      "text": "text-prompt here"
    }
  ]
}

2.4 Multi-Image Input

Each image can use either a public URL or a Base64 data URL.

json

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/outputs/658c7434-ec12-49cc-90e6-fe22ccccaf62_00001_.png"
      }
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,{base64_image}"
      }
    },
    {
      "type": "text",
      "text": "text-prompt here"
    }
  ]
}

Note: the DeepseekVL2 series is intended for short contexts. It is recommended to pass at most two images. If more than two images are provided, the model automatically resizes them to 384*384 and ignores the specified detail value.

Billing for Visual Inputs

Visual inputs are converted into tokens and billed together with text. The exact conversion differs by model.

For OpenAI-style image billing:

detail: low costs 85 tokens per image
detail: high first fits the image into a 2048 x 2048 square while preserving aspect ratio, then resizes the shortest side to 768px, and finally calculates the number of 512px tiles
each tile costs 170 tokens, plus a fixed 85-token base cost

Examples:

1024 x 1024 with detail: high costs 170 * 4 + 85 = 765
2048 x 4096 with detail: high costs 170 * 6 + 85 = 1105
4096 x 8192 with detail: low costs 85 tokens

Limitations

Known limitations include:

medical images: not suitable for interpreting CT scans or other specialized medical material
non-Latin text: recognition quality can drop for scripts such as Japanese or Korean
small text: enlarging the relevant area improves readability
rotation: upside-down or rotated content can be misread
visual elements: charts or diagrams with line-style differences may be hard to understand
spatial reasoning: tasks requiring exact positional reasoning can be unreliable
accuracy: some descriptions or titles may be incorrect
panoramas and fisheye images: performance is weaker
metadata: original filenames and metadata are not used
counting: only approximate counts should be expected
CAPTCHA: blocked for safety reasons

FAQ

Can GPT-4 with Vision generate images? No. Use image-generation models such as dall-e-3 for image creation, and use gpt-4o, gpt-4o-mini, or gpt-4-turbo for image understanding.

What file types are supported? PNG, JPEG, JPG, WEBP, and non-animated GIF.

Is there an upload size limit? Yes. The current upload limit is 20 MB per image.

Can uploaded images be deleted? They are deleted automatically after processing.

Where can I learn more? See the GPT-4 with Vision system card and the provider’s official documentation.

Google-Veo

阿里Wan(万相视频

Grok 视频

Seedance(即梦视频

简单版

官方接口格式

任务查询

GoAmzAI格式(兼容版，开发接入请勿对接

官方格式

简单版(goamz/rocket

General版

统一格式

换脸任务提交

任务提交

任务查询(免费

即梦4

OpenAI Chat 格式

OpenAI Dalle 格式

Replicate 官方格式

Bfl 官方格式

Visual Language Model Introduction

Use Cases

Usage

2.1 Image Detail Control

2.2 Example Message with an Image URL

2.3 Example Message with a Base64 Image

2.4 Multi-Image Input

Billing for Visual Inputs

Limitations

FAQ

任务查询

Visual Language Model Introduction ​

Use Cases ​

Usage ​

2.1 Image Detail Control ​

2.2 Example Message with an Image URL ​

2.3 Example Message with a Base64 Image ​

2.4 Multi-Image Input ​

Billing for Visual Inputs ​

Limitations ​

FAQ ​

Visual Language Model Introduction

Use Cases

Usage

2.1 Image Detail Control

2.2 Example Message with an Image URL

2.3 Example Message with a Base64 Image

2.4 Multi-Image Input

Billing for Visual Inputs

Limitations

FAQ