Visual Language Model Introduction
Use Cases
A visual language model (VLM) is a large model that can accept both visual (image) and language (text) inputs. With a VLM, you can provide images and text together, and the model can understand both the image content and the surrounding context to follow instructions. For example:
Visual interpretation: ask the model to explain the information in an image, such as objects, text, spatial relationships, colors, and mood; Carry out multi-turn conversations based on the visual content and context; Partially replace traditional computer-vision models such as OCR; As model capabilities improve, they may also be used in areas such as visual agents and robotics.
Usage
For VLM models, call the /chat/completions endpoint with a message containing either an image URL or a Base64-encoded image. Use the detail parameter to control image preprocessing.
2.1 Parameter description for image-detail control SiliconCloud provides three detail options: low, high, and auto. For currently supported models, omitted detail or detail=high uses high-resolution mode, while detail=low or detail=auto uses low-resolution mode.
2.2 Example message format that includes an image
Using an image URL
{
"role": "user",
"content":[
{
"type": "image_url",
"image_url": {
"url": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/outputs/658c7434-ec12-49cc-90e6-fe22ccccaf62_00001_.png",
"detail":"high"
}
},
{
"type": "text",
"text": "text-prompt here"
}
]
}2.2 Base64 format
{
"role": "user",
"content":[
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}",
"detail":"low"
}
},
{
"type": "text",
"text": "text-prompt here"
}
]
}2.3 Multi-image format, where each image can use either of the two forms above Note: the DeepseekVL2 series is designed for short-context processing, and at most two images are recommended. If more than two images are provided, the model automatically resizes them to 384*384 and the specified detail parameter becomes ineffective.
{
"role": "user",
"content":[
{
"type": "image_url",
"image_url": {
"url": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/outputs/658c7434-ec12-49cc-90e6-fe22ccccaf62_00001_.png",
}
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
},
{
"type": "text",
"text": "text-prompt here"
}
]
}Billing for Visual Inputs
For images and other visual inputs, the model converts them into tokens and uses them alongside text as context for generation, so they are also billed. Different models convert visual inputs differently.
Image inputs are metered and billed in tokens just like text. The token cost depends on two factors: image size and the detail option on each image_url block. With detail: low, each image costs 85 tokens. With detail: high, the image is first scaled to fit within a 2048 x 2048 square while preserving aspect ratio, then resized so the shortest side is 768px. The system then counts how many 512px tiles are needed. Each tile costs 170 tokens, and another 85 tokens are always added to the total.
Below are some examples illustrating the rules above.
A 1024 x 1024 square image with detail: high costs 765 tokens 1024 is less than 2048, so no initial resize is needed. The shortest side is 1024, so the image is resized to 768 x 768. Four 512px tiles are needed to represent the image, so the final token cost is 170 * 4 + 85 = 765.
A 2048 x 4096 image with detail: high costs 1105 tokens The image is resized to 1024 x 2048 to fit within the 2048 square. The shortest side is 1024, so it is further resized to 768 x 1536. Six 512px tiles are needed to represent the image, so the final token cost is 170 * 6 + 85 = 1105.
A 4096 x 8192 image with detail: low costs at most 85 tokens Low-detail images always have a fixed cost regardless of input size.
Limitations
Although vision-capable GPT-4 is powerful and useful in many scenarios, understanding its limitations is important. Here are some known limitations:
Medical images: the model is not suitable for interpreting specialized medical images such as CT scans and should not be used for medical advice.
Non-English text: when processing images containing non-Latin scripts such as Japanese or Korean, the model may not perform optimally.
Small text: enlarge text in the image to improve readability, but avoid cropping important details.
Rotation: the model may misinterpret rotated or upside-down text and images.
Visual elements: the model may struggle with graphics or text that vary by color or style, such as solid, dashed, or dotted lines.
Spatial reasoning: the model struggles with tasks requiring precise spatial positioning, such as identifying chessboard squares.
Accuracy: the model may produce incorrect descriptions or titles in some cases.
Image shape: the model struggles with panoramic and fisheye images.
Metadata and resizing: the model does not process original filenames or metadata, and images are resized before analysis, affecting their original dimensions.
Counting: the model can provide approximate counts of objects in an image.
CAPTCHA: for security reasons, we block CAPTCHA submissions.
FAQ
Can I use GPT-4 to generate images? No. Use dall-e-3 to generate images, and use gpt-4o, gpt-4o-mini, or gpt-4-turbo to understand images.
What file types can I upload? We currently support PNG (.png), JPEG (.jpeg and .jpg), WEBP (.webp), and non-animated GIF (.gif).
Is there a size limit for uploaded images? Yes. We limit each uploaded image to 20 MB.
Can I delete the images I upload? No action is needed. After the model processes the image, we delete it automatically.
Where can I learn more about GPT-4 with Vision? You can find details about our evaluations, preparedness, and mitigations in the GPT-4 with Vision system card. We have also implemented a system that blocks CAPTCHA submissions.
How do the rate limits for GPT-4 with Vision work? We process images at the token level, so every image counts toward your tokens-per-minute (TPM) limit. See the cost-calculation section for the formula used to determine the token count of each image.
Can GPT-4 with Vision understand image metadata? No. The model does not receive image metadata.
What if my image is unclear? If an image is blurry, the model will do its best to interpret it, but the result may be less accurate. A useful rule of thumb is that if a person cannot clearly see the information in the image at low or high resolution, the model probably cannot either.