Speech Recognition

Speech recognition, also known as Automatic Speech Recognition (ASR), is a technology that allows computers to "understand" human speech and convert it into text or execute corresponding commands.

Principles

Feature Extraction: Preprocess the input speech signal, such as noise reduction and framing, and then extract parameters that can represent speech features, such as Mel-Frequency Cepstral Coefficients (MFCC), converting the speech signal into a sequence of feature vectors that can be processed by a computer.
Acoustic Model: Using a large amount of speech data and machine learning algorithms, an acoustic model is established. It can learn the statistical relationships between speech features and corresponding phonemes or syllables, and can calculate the most likely sequence of phonemes or syllables based on the input speech feature vectors.
Language Model: The language model is used to describe the grammar, semantics, and statistical rules of the language. It can predict the next likely word based on the already recognized words or phrases, thereby correcting and supplementing the output of the acoustic model, improving recognition accuracy.
Decoding: Combine the information from the acoustic model and the language model, and use search algorithms to find the path that best fits both models among all possible results, completing the conversion from speech to text.

Running Example

The ASR implementation used by OriginMan involves noise reduction of the recorded audio, which is then transmitted to a large model gateway technology, where the audio is parsed into text through the gateway model.

Based on Python

First, you can experience ASR with a Python example.

cd /userdata/dev_ws/src/originman/originman_pydemo/audio
python3 audio_asr.py

In audio_asr.py, the default audio recognized is test.wav in its directory. If you have not recorded any audio, you can directly change

input_audio_file = 'test.wav'  --> input_audio_file = 'prompt.wav'

At this point, it will recognize the content of prompt.wav.

You will see the following terminal information:

root@ubuntu:/userdata/dev_ws/src/originman/originman_pydemo/audio# python3 audio_asr.py 
Sampling Rate: 16000, Duration: 0.84 seconds
Recognition Result:
I am.

You can also play the audio content to verify the content. test_audio will record and play the contents of prompt.wav and test.wav.

python3 test_audio.py

Based on ROS2 Topics

Since it can be implemented in Python, it can certainly be done in ROS2! It also supports publishing the recognized text data using topics.

Open a terminal and execute the following command:

ros2 run originman_llm_chat text_to_speech_node

At this point, you can see the log output with relevant usage log examples:

ros2 run originman_llm_chat asr_node 
[INFO] [1740719779.381260749] [asr_node]: ASR node started, beginning recording and recognition...
[INFO] [1740719779.389336500] [asr_node]: Received listening status, starting to listen...
[INFO] [1740719779.390696125] [asr_node]: Starting recording and recognition...
[INFO] [1740719782.430083504] [asr_node]: Recording audio: test.wav
[INFO] [1740719782.454768675] [asr_node]: Sampling Rate: 16000, Duration: 3.0 seconds
[INFO] [1740719783.690654553] [asr_node]: Recognized user speech: 'Hello, who are you?'
[INFO] [1740719783.693256470] [asr_node]: Published text to /text_input: 'Hello, who are you?'

Open another terminal and input the prompt content:

ros2 topic echo /text_input

Attention

You need to complete the network connection operation first. Please refer to Network Configuration and Remote Development Methods for the networking steps.