Skip to content

ue0705/Speech-To-Text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

Speech-To-Text

這是一個 Python 程式,他的功能是語音轉文字 (speech to text),原始的代碼是由 GPT-4o 產生後再加以修改的結果,由電腦的線性輸入孔(或者麥克風輸入孔)來輸入後辨識,傳統的語音辨識著重在"字",所以常常會前後對不上,當有了人工智慧的輔助,當他可以理解你的內容後,辨識的程度與錯誤率就會大幅地降低,並且將辨識的結果傳送到 Line APP 上面,一方便作為雲端儲存,另一方面他是台灣人用最普遍的訊息軟體,至於 WhatsApp/Telegram/Signal/FB-Message 以及 iMessage(Apple) 與 Google RCS(Android phone)則視情況增加此功能,並不困難。

然後我將聲音的輸入改成無線電訊號,在無比吵雜的環境還能夠辨識的清楚才是一個完整的功能,您可以看到裡面有一個關鍵字檢查(ex:mayday),目的在遇到緊急情況時,可以自動的產生額外的警示功能,之後會移植到 Raspberry Pi 上執行,並且拓展成為多幾系統。

另一個問題是,我絕對不是世界上第一個想到這樣做的人,所以我在 Google Search 後有找到這個 Youtube,但他的問題是輸入來源太過乾淨,無線電訊號是無比的吵雜,非常難以辨識(可以參考Audio目錄下的錄音範例,所有AI幾乎難以辨識),但例如航空機師他們天天在聽,所以幾乎都沒問題,這裡也有一個有趣的現象,除了聲音之外,還有監控系統,如果這個人是你認識且熟悉的,出現在監控影像中,你可以輕易的從外型/走路姿勢/穿著/行為,就輕易的判斷出他是誰,即使影像很模糊,但人工智慧卻只能辨識清楚的人臉,無法像真正的人類那樣判斷,所以這也就是為什麼無線電困難的地方。

Radio to text reference: https://www.youtube.com/watch?v=rZfNMtbRpYQ

Whisper speech to text tool: https://github.com/openai/whisper

This is a Python program. Its function is speech to text. The original code is generated by GPT-4o and then modified. It is input through the computer's linear input hole (or microphone input hole). Recognition, traditional speech recognition focuses on "words", so it often fails to match the front and back. With the assistance of artificial intelligence, when it can understand your content, the degree of recognition and error rate will be greatly reduced, and The recognition results are sent to the Line APP, which can be used as a cloud storage. On the other hand, it is the most commonly used messaging software in Taiwan. As for WhatsApp/Telegram/Signal/FB-Message and iMessage (Apple) and Google RCS (Android phone), it is not difficult to add this function depending on the situation.

Then I changed the sound input to a radio signal. It is a complete function that it can still be recognized clearly in an extremely noisy environment. You can see that there is a keyword check (ex:mayday), which is intended to be used in emergencies. When the situation arises, additional warning functions can be automatically generated, which will then be transplanted to the Raspberry Pi for execution and expanded into a multi-system system.

Another problem is that I am definitely not the first person in the world to think of doing this, so I found this Youtube after Google Search, but the problem is that the input source is too clean, and the radio signal is extremely noisy and very It is difficult to identify (you can refer to the recording example in the Audio directory, all AI is almost difficult to identify), but for example, aviation pilots listen to it every day, so there is almost no problem. There is also an interesting phenomenon here. In addition to sound, there is also monitoring System, if this person is someone you know and are familiar with, and appears in the surveillance image, you can easily determine who he is based on his appearance/walking posture/dress/behavior. Even if the image is blurry, artificial intelligence But it can only recognize faces clearly, and cannot judge like real humans, so this is why radio is difficult.

About

Speech to text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages