JZ_QGNB/DigitalHumanWeb/docs/usage/foundation/vision.mdx

---
title: Enhancing Multimodal Interaction with Visual Recognition Models
description: >-
  Explore how LobeChat integrates visual recognition capabilities into large
  language models, enabling multimodal interactions for enhanced user
  experiences.
tags:
  - Visual Recognition
  - Multimodal Interaction
  - Large Language Models
  - LobeChat
  - Custom Model Configuration
---

# Visual Model User Guide

The ecosystem of large language models that support visual recognition is becoming increasingly rich. Starting from `gpt-4-vision`, LobeChat now supports various large language models with visual recognition capabilities, enabling LobeChat to have multimodal interaction capabilities.

<Video alt={'Visual Model Usage'} src={'https://github.com/user-attachments/assets/1c6b4975-bfc3-4470-a934-558ff7a16941'} />

## Image Input

If the model you are currently using supports visual recognition, you can input image content by uploading a file or dragging the image directly into the input box. The model will automatically recognize the image content and provide feedback based on your prompts.

<Image alt={'Image Input'} src={'https://github.com/user-attachments/assets/e6836560-8b05-4382-b761-d7624da4b0f1'} />

## Visual Models

In the model list, models with a `👁️` icon next to their names indicate that the model supports visual recognition. Selecting such a model allows you to send image content.

<Image alt={'Visual Models'} src={'https://github.com/user-attachments/assets/fa07a326-04c8-4744-bb93-cef715d1d71f'} />

## Custom Model Configuration

If you need to add a custom model that is not currently in the list and explicitly supports visual recognition, you can enable the `Visual Recognition` feature in the `Custom Model Configuration` to allow the model to interact with images.

<Image alt={'Custom Model Configuration'} src={'https://github.com/user-attachments/assets/c24718cc-402b-4298-b046-8b4aee610cbc'} />