Skip to content

Voice Interaction and Control

After experiencing speech recognition, speech synthesis, and large model text interaction, this section will combine these three parts to achieve voice interaction! We'll also implement voice control to make the robot perform certain actions!

Voice Interaction Running Example

Turn on the OriginMan power switch and enter the following command in its terminal:

ros2 launch originman_llm_chat originman_llm_chat.launch.py

After running, you will see the following terminal output, and say something to OriginMan, for example: "Hello, who are you?"

root@ubuntu:~# ros2 launch originman_llm_chat originman_llm_chat.launch.py 
[INFO] [launch]: All log files can be found below /root/.ros/log/2025-02-28-13-31-04-432778-ubuntu-38680
[INFO] [launch]: Default logging verbosity is set to INFO
[INFO] [text_to_speech_node-1]: process started with pid [38693]
[INFO] [llm_chat_node-2]: process started with pid [38695]
[INFO] [asr_node-3]: process started with pid [38697]
[text_to_speech_node-1] [INFO] [1740720667.721629199] [text_to_speech_node]: Usage example: ros2 topic pub /tts_input std_msgs/msg/String "data: 'Please tell me today's weather.'"
[text_to_speech_node-1] [INFO] [1740720667.724207952] [text_to_speech_node]: Text-to-speech node has started, waiting for input on /tts_input topic...
[llm_chat_node-2] [INFO] [1740720668.992896394] [llm_chat_node]: OpenAI interaction node has started, waiting for input on /text_input topic...
[llm_chat_node-2] [INFO] [1740720668.995342605] [llm_chat_node]: Usage example: ros2 topic pub --once /text_input std_msgs/msg/String "data: 'Hello, how is the weather today?'"
[asr_node-3] [INFO] [1740720670.683385618] [asr_node]: ASR node has started, starting recording and recognition...
[asr_node-3] [INFO] [1740720670.690166792] [asr_node]: Starting recording and recognition...
[asr_node-3] [INFO] [1740720670.694668880] [asr_node]: Received listening status, starting to listen...
[asr_node-3] [INFO] [1740720673.730837496] [asr_node]: Recording audio: test.wav
[asr_node-3] [INFO] [1740720673.756912233] [asr_node]: Sampling rate: 16000, duration: 3.0 seconds
[asr_node-3] [INFO] [1740720674.881400056] [asr_node]: Recognized user voice: 'Hello, who are you?'
[asr_node-3] [INFO] [1740720674.884098559] [asr_node]: Published text to /text_input: 'Hello, who are you?'
[llm_chat_node-2] [INFO] [1740720674.885190268] [llm_chat_node]: Received text: 'Hello, who are you?'
[llm_chat_node-2] 2025-02-28 13:31:15,772 - INFO - HTTP Request: POST http://59.110.158.57:3000/api/v1/chat/completions "HTTP/1.1 200 OK"
[asr_node-3] [INFO] [1740720675.891859882] [asr_node]: Starting recording and recognition...
[asr_node-3] [INFO] [1740720675.892257591] [asr_node]: Received listening status, starting to listen...
[llm_chat_node-2] 2025-02-28 13:31:16,562 - INFO - AI response: I'm OriginMan, a handsome humanoid robot. Feel free to ask me any questions, and I'll answer you with humor!
[llm_chat_node-2] [INFO] [1740720676.566284933] [llm_chat_node]: Sent TTS message: 'I'm OriginMan, a handsome humanoid robot. Feel free to ask me any questions, and I'll answer you with humor!'
[text_to_speech_node-1] [INFO] [1740720676.567079142] [text_to_speech_node]: Received text: 'I'm OriginMan, a handsome humanoid robot. Feel free to ask me any questions, and I'll answer you with humor!'
[text_to_speech_node-1] [INFO] [1740720676.569749478] [text_to_speech_node]: Starting to play audio...
[asr_node-3] [INFO] [1740720676.570593437] [asr_node]: Received playback status, speaking, pausing recording...
[text_to_speech_node-1] [INFO] [1740720676.573927483] [text_to_speech_node]: Text split into 2 sentences
[text_to_speech_node-1] [INFO] [1740720676.582458200] [text_to_speech_node]: Starting to synthesize text: 'Feel free to ask me any questions, and I'll answer you with humor.'
[text_to_speech_node-1] [INFO] [1740720676.583595951] [text_to_speech_node]: Starting to synthesize text: 'I'm OriginMan, a handsome humanoid robot.'
[text_to_speech_node-1] 2025-02-28 13:31:16,993 - INFO - Websocket connected
[text_to_speech_node-1] 2025-02-28 13:31:16,996 - INFO - Websocket connected
[text_to_speech_node-1] [INFO] [1740720678.380090152] [text_to_speech_node]: Playing audio segment 1
[asr_node-3] [INFO] [1740720678.945347378] [asr_node]: Recording audio: test.wav
[asr_node-3] [INFO] [1740720679.005670400] [asr_node]: Sampling rate: 16000, duration: 3.0 seconds
[asr_node-3] [INFO] [1740720679.829108107] [asr_node]: No content recognized
[asr_node-3] [INFO] [1740720680.839732554] [asr_node]: Playing, pausing recording...
[asr_node-3] [INFO] [1740720681.850556125] [asr_node]: Playing, pausing recording...
[text_to_speech_node-1] [INFO] [1740720682.299188975] [text_to_speech_node]: Playing audio segment 2
[asr_node-3] [INFO] [1740720682.861295195] [asr_node]: Playing, pausing recording...
[asr_node-3] [INFO] [1740720683.872126640] [asr_node]: Playing, pausing recording...
[asr_node-3] [INFO] [1740720684.882228874] [asr_node]: Playing, pausing recording...
[text_to_speech_node-1] [INFO] [1740720685.698440778] [text_to_speech_node]: All audio segments played
[text_to_speech_node-1] [INFO] [1740720685.704874118] [text_to_speech_node]: Playback ended, entering listening state...
[asr_node-3] [INFO] [1740720685.706222703] [asr_node]: Received listening status, starting to listen...
[asr_node-3] [INFO] [1740720685.904607204] [asr_node]: Received listening status, starting to listen...
[asr_node-3] [INFO] [1740720685.905819122] [asr_node]: Starting recording and recognition...

At this point, you can see that OriginMan has completed ASR, LLM text generation, and TTS!

Voice Control Running Example

OriginMan also supports voice control actions. Please run the following command:

ros2 launch originman_audio_control audio_control.launch.py

Now you can give OriginMan commands like: laugh heartily, bow, do sit-ups...

Taking "laugh heartily" as an example:

root@ubuntu:~# ros2 launch originman_audio_control audio_control.launch.py 
[INFO] [launch]: All log files can be found below /root/.ros/log/2025-02-28-13-34-55-172981-ubuntu-39392
[INFO] [launch]: Default logging verbosity is set to INFO
[INFO] [asr_node-1]: process started with pid [39405]
[INFO] [audio_control_node-2]: process started with pid [39407]
[audio_control_node-2] [INFO] [1740720896.542396669] [audio_control_node]: Action control node has started, executing initial standing action...
[audio_control_node-2] [INFO] [1740720897.048742408] [audio_control_node]: Executing action group: stand
[asr_node-1] [INFO] [1740720900.783398244] [asr_node]: ASR node has started, starting recording and recognition...
[asr_node-1] [INFO] [1740720900.791614501] [asr_node]: Starting recording and recognition...
[asr_node-1] [INFO] [1740720900.792100168] [asr_node]: Received listening status, starting to listen...
[audio_control_node-2] [INFO] [1740720902.056489656] [audio_control_node]: Action control node is ready, waiting for voice command...
[asr_node-1] [INFO] [1740720915.662410345] [asr_node]: Starting recording and recognition...
[asr_node-1] [INFO] [1740720918.716207620] [asr_node]: Recording audio: test.wav
[asr_node-1] [INFO] [1740720918.749889649] [asr_node]: Sampling rate: 16000, duration: 3.0 seconds
[asr_node-1] [INFO] [1740720919.839121135] [asr_node]: Recognized user voice: 'Laugh heartily.'
[asr_node-1] [INFO] [1740720919.841752929] [asr_node]: Published text to /text_input: 'Laugh heartily.'
[audio_control_node-2] [INFO] [1740720919.843013013] [audio_control_node]: Received voice command: 'Laugh heartily.'
[audio_control_node-2] [INFO] [1740720919.846050474] [audio_control_node]: Fuzzy match successful: 'Laugh heartily.' -> 'Laugh heartily'
[asr_node-1] [INFO] [1740720920.865806773] [asr_node]: Received listening status, starting to listen...
[audio_control_node-2] [INFO] [1740720928.365244782] [audio_control_node]: Executing action group: chest

At this point, you can see OriginMan "laughing heartily"!

Attention

Network connection needs to be completed first. For network configuration steps, please refer to Network Configuration and Remote Development Methods

Image1