LLM-based Visual Recognition

How It Works

Intent Understanding: When a command is given, whether it's "walk forward two steps and then wave," or "what do you see?," the system first sends this command (as voice or text) to a Large Language Model (LLM).
Task Decomposition & Tool Selection: We have pre-defined all the "tools" the robot can use for the LLM (i.e., its various capabilities like visual recognition, executing specific actions, singing, etc.). The LLM acts like an intelligent commander, analyzing your instruction and breaking it down into one or more steps the robot can execute.
Generating Structured Commands: The LLM does not return a plain text chat response. Instead, it generates a structured command that the machine can precisely understand and dispatches it to the corresponding functional modules (such as motion control, visual analysis, etc.) for execution.

This approach gives the robot unprecedented flexibility and understanding, enabling it to handle ambiguous and complex natural language commands.

The robot can understand what you say and respond in a natural voice, enabling smooth, conversational interaction.

Implementation

Speech-to-Text (ASR): When you click the microphone on the web interface and speak, OriginMan captures your voice. It sends the recorded audio to a large model, which efficiently and accurately converts your speech into text.
Text-to-Speech (TTS): When the robot needs to reply or give a prompt, a large model generates the corresponding text and produces high-quality, expressive speech, which is played through the speaker.

The robot is more than just an executor; it has a pair of "eyes" to observe and understand the world around it.

Implementation

Continuous Observation: The robot's camera captures images at a fixed frequency and publishes them to the ROS network, providing the system with real-time visual input.
Combining Images and Questions: When you ask a visual question, such as "What do you see?", the robot retrieves the latest camera image and sends it along with your text question to a multimodal large model.
Generating Descriptive Answers: The large model can understand the relationship between the image content and the text question, generating a detailed, human-like descriptive text as a response, for example, "I see a red apple on the table..."

Whether it's simple movements or complex dances, the robot can accurately execute your commands.

Implementation

Action Command Parsing: After receiving your natural language command, the robot requests a large model to parse it. This model is already aware of the list of all executable action names for the robot.
Generating Action Sequences: The LLM generates a precise command based on your language. It parses the action name and repetition count, then calls the underlying hardware control library to drive the robot's servos and complete the corresponding physical action.

Interacting with the OriginMan robot is very simple:

Apply for an API Key: Apply for an API Key on the Alibaba Cloud Bailian official website.
Start the System: Use ros2 launch originman_vision robot_interaction.launch.py to launch the relevant nodes.
Access the Web Interface: In your browser, open http://<robot_ip_address>:5000 to access the interface shown in the image below.
Configure the API Key: In the input box at the top of the page, enter your Alibaba Cloud Dashscope API Key and click "Save and Distribute API Key." All AI functions depend on this key.

Note: If you do not need to change the key frequently, you can add a fixed key directly in the /root/.bashrc file using the command export DASHSCOPE_API_KEY="sk-xxxxxx". The part in quotes must be replaced with the actual key you applied for.
Start Interacting:
- Text Input: Type a command in the text box at the bottom, such as "Hello," "Please walk forward three steps and then wave," or "What do you see?", and click "Send."
- Voice Input: Click the microphone icon; it will turn red, indicating it is recording. Speak your command, then click the icon again to stop recording. The system will automatically recognize your speech and execute the command.
- Observe and Interact: Watch the robot's actions and listen to its voice feedback in the real world. You can also view the complete interaction history in the chat log on the web interface. Enjoy exploring the fun of conversing with an AI robot!