Visual Language Model Introduction
Use Cases
A visual language model (VLM) accepts both images and text as input. With a VLM, you can combine an image with text instructions so the model can understand visual content together with the surrounding context and respond accordingly. Common scenarios include:
- interpreting visual content such as objects, text, spatial relationships, colors, and scene mood
- carrying out multi-turn conversations grounded in images
- partially replacing traditional OCR and other computer-vision pipelines
- preparing for more advanced applications such as visual agents and robotics
Usage
For VLM models, call /chat/completions with a message that contains either an image URL or a Base64-encoded image. Use the detail parameter to control image preprocessing.
2.1 Image Detail Control
SiliconCloud provides three detail options: low, high, and auto. For currently supported models, omitting detail or setting it to high uses high-resolution mode, while low or auto uses low-resolution mode.
2.2 Example Message with an Image URL
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/outputs/658c7434-ec12-49cc-90e6-fe22ccccaf62_00001_.png",
"detail": "high"
}
},
{
"type": "text",
"text": "text-prompt here"
}
]
}2.3 Example Message with a Base64 Image
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,{base64_image}",
"detail": "low"
}
},
{
"type": "text",
"text": "text-prompt here"
}
]
}2.4 Multi-Image Input
Each image can use either a public URL or a Base64 data URL.
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/outputs/658c7434-ec12-49cc-90e6-fe22ccccaf62_00001_.png"
}
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,{base64_image}"
}
},
{
"type": "text",
"text": "text-prompt here"
}
]
}Note: the DeepseekVL2 series is intended for short contexts. It is recommended to pass at most two images. If more than two images are provided, the model automatically resizes them to 384*384 and ignores the specified detail value.
Billing for Visual Inputs
Visual inputs are converted into tokens and billed together with text. The exact conversion differs by model.
For OpenAI-style image billing:
- detail: low costs 85 tokens per image
- detail: high first fits the image into a 2048 x 2048 square while preserving aspect ratio, then resizes the shortest side to 768px, and finally calculates the number of 512px tiles
- each tile costs 170 tokens, plus a fixed 85-token base cost
Examples:
- 1024 x 1024 with detail: high costs 170 * 4 + 85 = 765
- 2048 x 4096 with detail: high costs 170 * 6 + 85 = 1105
- 4096 x 8192 with detail: low costs 85 tokens
Limitations
Known limitations include:
- medical images: not suitable for interpreting CT scans or other specialized medical material
- non-Latin text: recognition quality can drop for scripts such as Japanese or Korean
- small text: enlarging the relevant area improves readability
- rotation: upside-down or rotated content can be misread
- visual elements: charts or diagrams with line-style differences may be hard to understand
- spatial reasoning: tasks requiring exact positional reasoning can be unreliable
- accuracy: some descriptions or titles may be incorrect
- panoramas and fisheye images: performance is weaker
- metadata: original filenames and metadata are not used
- counting: only approximate counts should be expected
- CAPTCHA: blocked for safety reasons
FAQ
Can GPT-4 with Vision generate images? No. Use image-generation models such as dall-e-3 for image creation, and use gpt-4o, gpt-4o-mini, or gpt-4-turbo for image understanding.
What file types are supported? PNG, JPEG, JPG, WEBP, and non-animated GIF.
Is there an upload size limit? Yes. The current upload limit is 20 MB per image.
Can uploaded images be deleted? They are deleted automatically after processing.
Where can I learn more? See the GPT-4 with Vision system card and the provider’s official documentation.