Transcribing Speech to Text with Python and Google Cloud Speech API

Table of Contents

This tutorial will walk through using Google Cloud Speech API to transcribe a large audio file.

All code and sample files can be found in speech-to-text GitHub repo.

View Post

Sample Results

This approach works, but I found that result will vary greatly based on the quality of input.

Transcribing a Reading by My Wife

I asked my wife to read something out loud as if she was dictating to Siri for about 1.5 minutes. She is a native English speaker and we recorded using a microphone on iPhone 6s.



Which resulted in the following transcript:

00:00:00 this Dynamic Workshop aims to provide up to date information on pharmacological approaches, issues, and treatment in the geriatric population to assist in preventing medication-related problems, appropriately and effectively managing medications and compliance. The concept of polypharmacy parentheses taking multiple types of drugs parentheses will also be discussed, as the
00:00:30 is a common issue that can impact adverse side effects in the geriatric population. Participants will leave with a knowledge and considerations of common drug interaction and how to minimize the effects that limit function. Summit professional education is approved provider of continuing education. This course is offered for 6
00:01:00 . this course contains a Content classified under the both the domain of occupational therapy and professional issues.

I think that Google Cloud Speech API did an amazing job, getting over 95% of the content right. Especially considering that this was not a professional recording and that you can hear my kid saying something in the background 🙂

Transcribing a Radio Broadcast with Few Different Voices

A reader sent me the following audio file recorded from 95.5 Sports Hub radio (broadcast on January 26th 2018), Toucher & Rich morning show. This too, turned out better than I expected.

00:00:00 announced that there was going to be a new XXX FL it was going to start in two years and here’s what he had to say that you accept kickoff in 2020 quite frankly we’re going to give the game of football back to fans I’m sure everyone has a lot of questions for me but I also have a lot of questions for you in fact we’re going to ask a lot of questions and listen to players coaches
00:00:30 call experts technology executive members of the media and anyone else who understands and loves the game of football but most importantly we’re going to be listening to someone ask that the will the question of what would you do if you can reimagine the game of professional football would you frenchtons eliminate halftime would you have if you were commercial breaks but the game of foot
00:01:00 I’ll be faster when the rules be simpler can you ask Chef elevated fan Centric with all the things you like to see in the last of the things you don’t and no doubt a lot of Innovations along the way we will put you at a shorter faster-paced family-friendly and easier to understand game don’t get me wrong it’s still football but it’s professional football reimagined Sims 4 launching a 20
00:01:30 hey we have two years which is plenty of time to really get it right so aside from family friendly which I just think means that you have to stand for the national anthem I have no idea because the other one was very sex. That’s why is it either it was the cheerleaders with the super tight outfits and stuff cheerleaders were dressed and I stripped it sounds like a very good idea sounds like he has he has no plan no he does he’s taking everything he does have
00:02:00 and it said all the teams are going to be owned by the same entity he knows that they’re starting with a team and that they’re going to be shorter games with maybe no halftime with inferior Talent no not necessarily interior Town there’s already a saturation of football as is that is the biggest thing that people been complaining about the game what is he thinking you know what he said you ate yesterday you said we’re going to make it short and then we want your ideas no gimmicks all the things that God was just playing around
00:02:30 this does feel like a guy who’s had enormous prefer

Transcribing a Speech by Winston Churchill

I wanted to challenge the script further, so I decided to run in on a famous speech by Winston Churchill, titled The Threat of Nazi Germany.

Here is the audio file:



Which resulted in the following transcript:

00:00:00 many people think that the best way to escape War if the dwelling and then print them DVD for the younger generation they plump the grizzly photographs Before Their Eyes they feel that they dilate of generals and admirals they do not fit the crime I didn’t think they’d father
00:00:30 human strife how old is teaching in preventing us from attacking or invading any other country with the do so how would it help if we were attacked or invaded on stove that is a question we have to ask what did they does contempt of the Lord Beaverbrook
00:01:00 I’ll listen to the impassioned the field by George would they agree to meet that famous South African general identity I have bone responsibilities for the safety of this country in grievance time
00:01:30 we could convince and persuade them to go back play my play it seems to me you are rich we are what we are hungry it would be in Victoria’s we have been defeated you have valuable, we have not you have your name you have had the phone
00:02:00 set up pencil future about all I see are they would say you are weak and we are strong after all my friend your nephew all the way by that railing for nation of nearly 70 million the most educated industrial scientific discipline people in the world loving cup from childhood
00:02:30 all Epic Gloria Texas iron and death in battle at the noblest face for men yeah I need the nation we could have been done in order to augment its Collective Strength yeah definition of a group of preaching a gospel of intolerance and unrestrained by the wall by Parliament
00:03:00 public opinion in that country all packages speeches or morbid Wahlberg off of getting off the press I’m down you cable of Columbus they have a meeting dial shalt not kill it is the plenty of photos and or both now
00:03:30 play Ariana me with the upload speed I’m ready to that end lamentable weapon Javier against which all Navy is no defense and before which women and children so weak and frail capacity of the warriors on the front-line trenches all live equal adding partial patio
00:04:00 play with you but with the new weapon, new method of compelling the submission of racing bike terrorizing and torturing population and worst of all the more
00:04:30 the ball in cricket the structure of its social and economic life some more of those who may make it there praying love you too fat Grim despicable fact and invasive affect ionic again what are we to do

The result is an order of magnitude worse than my wife’s recording. Most likely it is caused by poor audio quality. In addition, Churchill used a lot of words that are no longer commonly used.

If you are still reading, let’s get started.

1. Sign Up for a Free Tier Account

Google Cloud offers a Free Tier plan, which will be used in this tutorial. An account is required to get an API key.

2. Generate an API Key

Follow these steps to generate an API key:

  1. Sign-in to Google Cloud Console
  2. Click “APIs & Services”
  3. Click “Credentials”
  4. Click “Create Credentials”
  5. Select “Service Account Key”
  6. Under “Service Account” select “New service account”
  7. Name service (whatever you’d like)
  8. Select Role: “Project” -> “Owner”
  9. Leave “JSON” option selected
  10. Click “Create”
  11. Save generated API key file
  12. Rename file to api-key.json

Make sure to move the key into speech-to-text cloned repo, if you plan to test this code.

3. Convert Audio File to Wav format

I ran into issues when trying to convert my audio file via a command line tools. Instead, I used Audacity (an open source audio editing tool) to convert my file to wav format. Audacity is great and I highly recommended it.

The steps to convert:

  1. Open file in Audacity
  2. Click “File” menu
  3. Click “Save other”
  4. Click “Export as Wav”
  5. Export it with default setting

4. Break up audio file into smaller parts

Google Cloud Speech API only accepts files no longer than 60 seconds. To be on the safe side, I broke my files in 30-second chunks. To do that I used an open source command line library called ffmpeg. It can be download from its site. On Mac, I installed it with Homebrew via brew install ffmpeg.

Here is the command I used to break up my file:

# Clean out old parts if needed via rm -rf parts/*
ffmpeg -i source/genevieve.wav -f segment -segment_time 30 -c copy parts/out%09d.wav
Code language: PHP (php)

Where, source/genevieve.wav is the name of the input file, and parts/out%09d.wav is the format for output files. %09d indicated that the file number will be padded with 9 zeros (i.e. out000000001.wav), allowing files to be sorted alphabetically. This way ls command returns files sorted in the right order.

5. Install required Python modules

I added requirements.txt in example repo with all needed libraries. It can be used to install all via:

pip3 install -r requirements.txt
Code language: CSS (css)

The real hero on this list is the SpeechRecognition. It does most of the heavy lifting.

The rest of the libraries came with the official google-api-python-client package.

I also used tqdm module to show progress in the slower version of the script.

6. Running the Code

Finally, we can run the Python script to get the transcript. For example python3 fast.py.

The slow version

Here is the Github link.

This script:

  1. Loads API key from step 2 in memory
  2. Gets a list of files (chunks)
  3. For every file, calls speech to text API endpoint
  4. Adds results to a list
  5. Combines all results and adds a timestamp (every 30 seconds)
  6. Saves results to transcript.txt
import os
import speech_recognition as sr
from tqdm import tqdm

with open("api-key.json") as f:
    GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()

r = sr.Recognizer()
files = sorted(os.listdir('parts/'))

all_text = []

for f in tqdm(files):
    name = "parts/" + f
    # Load audio file
    with sr.AudioFile(name) as source:
        audio = r.record(source)
    # Transcribe audio file
    text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
    all_text.append(text)

transcript = ""
for i, t in enumerate(all_text):
    total_seconds = i * 30
    # Cool shortcut from:
    # https://stackoverflow.com/questions/775049/python-time-seconds-to-hms
    # to get hours, minutes and seconds
    m, s = divmod(total_seconds, 60)
    h, m = divmod(m, 60)

    # Format time as h:m:s - 30 seconds of text
    transcript = transcript + "{:0>2d}:{:0>2d}:{:0>2d} {}\n".format(h, m, s, t)

print(transcript)

with open("transcript.txt", "w") as f:
    f.write(transcript)
Code language: PHP (php)

The code works, but it does take a while on longer source files.

Faster version

To speed things up, I added threading to my slow version. I describe the method used in detail in Simple Python Threading Example post.

Here is the GitHub Link.

The main difference is that I moved processing into a function and added logic, in the end, to sort processed results in the right order.

import os
import speech_recognition as sr
from tqdm import tqdm
from multiprocessing.dummy import Pool
pool = Pool(8) # Number of concurrent threads

with open("api-key.json") as f:
    GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()

r = sr.Recognizer()
files = sorted(os.listdir('parts/'))

def transcribe(data):
    idx, file = data
    name = "parts/" + file
    print(name + " started")
    # Load audio file
    with sr.AudioFile(name) as source:
        audio = r.record(source)
    # Transcribe audio file
    text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
    print(name + " done")
    return {
        "idx": idx,
        "text": text
    }

all_text = pool.map(transcribe, enumerate(files))
pool.close()
pool.join()

transcript = ""
for t in sorted(all_text, key=lambda x: x['idx']):
    total_seconds = t['idx'] * 30
    # Cool shortcut from:
    # https://stackoverflow.com/questions/775049/python-time-seconds-to-hms
    # to get hours, minutes and seconds
    m, s = divmod(total_seconds, 60)
    h, m = divmod(m, 60)

    # Format time as h:m:s - 30 seconds of text
    transcript = transcript + "{:0>2d}:{:0>2d}:{:0>2d} {}\n".format(h, m, s, t['text'])

print(transcript)

with open("transcript.txt", "w") as f:
    f.write(transcript)
Code language: PHP (php)

Conclusion

Results may vary, but there is utility even in poor transcriptions. For example, I had an hour and a half audio recording from a hand-over meeting with my former co-worker. I remembered that he mentioned something at some point, but was dreading listening through 1.5-hour audio file to find it. I ran the recording through this script and was able to quickly find needed keywords and timestamp pointed me to the right part of the audio file.

For native English speakers like my wife, Google Cloud Speech API can easily replace a professional transcribing service, at a fraction of a cost.

78 thoughts on “Transcribing Speech to Text with Python and Google Cloud Speech API”

  1. Google API is not free you still need to enter the CC details in order to use the 60mins for free/month. You should’ve mentioned it in the beginning so no one will try to find out it’s not gonna work.

  2. What if the file contains 4 minutes of audio? I think its gonna be bit messy; instead breaking them in to smaller parts, is there anyway to break them 4 minutes each for 8 minutes audio file? Why does Google Cloud Speech API only accepts files no longer than 60 seconds?If Google Cloud Speech API works to transcribe a large audio file in one shot instead splitting them then it could have been easier for us. Since we are not a tech geek though we have caliber to learn a bit of coding.

    Does this API app really help me to transcribe both small n larger audio files into the text format? Since I am a Transcriber

  3. Hi Alex! Thank you for this article, excelent!!!;

    I tried to run the script to slice the audio and got the following error:
    SyntaxError: invalid syntax
    [Finished in 0.9s with exit code 1]
    [shell_cmd: python3 -OO -u “/Users/SilvinoDiaz/Desktop/speech-to-text-master/untitled.py”]
    [dir: /Users/SilvinoDiaz/Desktop/speech-to-text-master]
    [path: /Users/SilvinoDiaz/opt/anaconda3/bin:/Users/SilvinoDiaz/opt/anaconda3/condabin:/Users/SilvinoDiaz/anaconda3/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/anaconda3/bin:/Library/Frameworks/Python.framework/Versions/3.6/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/share/dotnet:/opt/X11/bin:~/.dotnet/tools:/Library/Frameworks/Mono.framework/Versions/Current/Commands]

    The IDLE is ST3
    I don’t know if it has to do with the installation of ‘anconda’ which causes the failure.
    Any idea?
    Thank you very much.

  4. Hi, Thanks for this code. For more than 10 minutes, the chunk number 11 and 12 appears as the second oaragraph and this part of the text becomes misplaced. My question is why is this happening?

  5. Alex, when I try and run ffmpeg to break up the audio file, it keeps giving me an error saying that it couldn’t segment and write the headers, how would I change the command so that ffmpeg creates each wav file as it goes??

  6. Alex, I am getting this error when I try and use ffmpeg to break up my audio file:

    C:\Users\hmkur\Desktop\Python\Transcribing_Audio>ffmpeg -i source/valve.wav -f segment -segment_time 30 -c copy parts/out%09d.wav
    ffmpeg version 4.2.1 Copyright (c) 2000-2019 the FFmpeg developers
    built with gcc 9.1.1 (GCC) 20190807
    configuration: –enable-gpl –enable-version3 –enable-sdl2 –enable-fontconfig –enable-gnutls –enable-iconv –enable-libass –enable-libdav1d –enable-libbluray –enable-libfreetype –enable-libmp3lame –enable-libopencore-amrnb –enable-libopencore-amrwb –enable-libopenjpeg –enable-libopus –enable-libshine –enable-libsnappy –enable-libsoxr –enable-libtheora –enable-libtwolame –enable-libvpx –enable-libwavpack –enable-libwebp –enable-libx264 –enable-libx265 –enable-libxml2 –enable-libzimg –enable-lzma –enable-zlib –enable-gmp –enable-libvidstab –enable-libvorbis –enable-libvo-amrwbenc –enable-libmysofa –enable-libspeex –enable-libxvid –enable-libaom –enable-libmfx –enable-amf –enable-ffnvcodec –enable-cuvid –enable-d3d11va –enable-nvenc –enable-nvdec –enable-dxva2 –enable-avisynth –enable-libopenmpt
    libavutil 56. 31.100 / 56. 31.100
    libavcodec 58. 54.100 / 58. 54.100
    libavformat 58. 29.100 / 58. 29.100
    libavdevice 58. 8.100 / 58. 8.100
    libavfilter 7. 57.100 / 7. 57.100
    libswscale 5. 5.100 / 5. 5.100
    libswresample 3. 5.100 / 3. 5.100
    libpostproc 55. 5.100 / 55. 5.100
    [wav @ 0000015fe3028d80] Discarding ID3 tags because more suitable tags were found.
    Guessed Channel Layout for Input Stream #0.0 : stereo
    Input #0, wav, from ‘source/valve.wav’:
    Metadata:
    title : valve
    encoder : Lavf58.20.100 (libsndfile-1.0.24)
    Duration: 00:06:47.20, bitrate: 1411 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s16, 1411 kb/s
    [segment @ 0000015fe3461640] Opening ‘parts/out000000000.wav’ for writing
    [segment @ 0000015fe3461640] Failed to open segment ‘parts/out000000000.wav’
    Could not write header for output file #0 (incorrect codec parameters ?): No such file or directory
    Stream mapping:
    Stream #0:0 -> #0:0 (copy)
    Last message repeated 1 times

    How can I change the code so that it creates a new wav file everytime it needs to??

  7. Alex, when I run ffmpeg to try and break up my audio file, it is giving me this error:

    C:\Users\hmkur\Desktop\Python\Transcribing_Audio>ffmpeg -i source/valve.wav -f segment -segment_time 30 -c copy parts/out%09d.wav
    ffmpeg version 4.2.1 Copyright (c) 2000-2019 the FFmpeg developers
    built with gcc 9.1.1 (GCC) 20190807
    configuration: –enable-gpl –enable-version3 –enable-sdl2 –enable-fontconfig –enable-gnutls –enable-iconv –enable-libass –enable-libdav1d –enable-libbluray –enable-libfreetype –enable-libmp3lame –enable-libopencore-amrnb –enable-libopencore-amrwb –enable-libopenjpeg –enable-libopus –enable-libshine –enable-libsnappy –enable-libsoxr –enable-libtheora –enable-libtwolame –enable-libvpx –enable-libwavpack –enable-libwebp –enable-libx264 –enable-libx265 –enable-libxml2 –enable-libzimg –enable-lzma –enable-zlib –enable-gmp –enable-libvidstab –enable-libvorbis –enable-libvo-amrwbenc –enable-libmysofa –enable-libspeex –enable-libxvid –enable-libaom –enable-libmfx –enable-amf –enable-ffnvcodec –enable-cuvid –enable-d3d11va –enable-nvenc –enable-nvdec –enable-dxva2 –enable-avisynth –enable-libopenmpt
    libavutil 56. 31.100 / 56. 31.100
    libavcodec 58. 54.100 / 58. 54.100
    libavformat 58. 29.100 / 58. 29.100
    libavdevice 58. 8.100 / 58. 8.100
    libavfilter 7. 57.100 / 7. 57.100
    libswscale 5. 5.100 / 5. 5.100
    libswresample 3. 5.100 / 3. 5.100
    libpostproc 55. 5.100 / 55. 5.100
    [wav @ 0000015fe3028d80] Discarding ID3 tags because more suitable tags were found.
    Guessed Channel Layout for Input Stream #0.0 : stereo
    Input #0, wav, from ‘source/valve.wav’:
    Metadata:
    title : valve
    encoder : Lavf58.20.100 (libsndfile-1.0.24)
    Duration: 00:06:47.20, bitrate: 1411 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s16, 1411 kb/s
    [segment @ 0000015fe3461640] Opening ‘parts/out000000000.wav’ for writing
    [segment @ 0000015fe3461640] Failed to open segment ‘parts/out000000000.wav’
    Could not write header for output file #0 (incorrect codec parameters ?): No such file or directory
    Stream mapping:
    Stream #0:0 -> #0:0 (copy)
    Last message repeated 1 times

    It is saying that it failed to open segment, it seems that though this might mean that and empty .wav needs to be waiting for each segment??? How can I change the code so that it creates a .wav file when it needs to?

  8. Hi Alex,

    I am using your code to convert some voice commands to text, but run into this error when I run the ‘fast.py’ script.


    File “/Users/Tony/anaconda3/lib/python3.7/site-packages/speech_recognition/init.py”, line 937, in recognize_google_cloud
    if “results” not in response or len(response[“results”]) == 0: raise UnknownValueError()

    UnknownValueError

    I think I’ve followed all the steps correctly, except for step 4, as my files are already smaller than 30 seconds. Have very little coding experience, any insight on this would be greatly appreciated! 🙂

    Kind regards,

    Tony

  9. Hi, have you thought about implementing a self-hosted audio transcribe server. This would be a great addition to the community as I agree that many of the professional services costs too much for individuals who uses it occasionally (like me!). Thanks for the insightful article.

    • I have, it would still need Google Cloud Auth, unless I wanted to pay for it myself. I think it would be fairly simple for somebody to do using Google Cloud API as outlined in this article, but ultimately I didn’t feel like I wanted to make a business out of it and didn’t have time to work on it as a side project (my free time is fairly limited since I have two little kids).

  10. Alex, probably a duplicate reply here, didn’t save first, my bad. I have made a fork and a couple of enhancements without over engineering and didn’t know if you want “forks” or “contributions to new branch or master. Sent a Tweet as well.

  11. Hi Alex,

    FYI – First, love it, great example of how to get off the ground! Thank you so much for what you have produced and shared!
    QUESTION / ACTION REQUESTED: I have a couple of DCR/Issues I found and I have made changes to address them and wanted to know how you would propose integrating them?

    My proposals
    2a. a new git hub project branched from yours since it is reference for the article
    2b. You determine and establish collaboration guidelines on your github project and I and others like MP below create issues and code check-ins against them (with maybe dev tests 🙂 ) on a separate branch which you can review and decide if they warrant inclusion in your project based on your goal and scope and release as a new version
    2c. Something better you or MP or others come up with.

    Cheers!

    • Sorry, I don’t think I ever got notified of this. I just changed jobs, and it’s possible that I overlooked it.

      I think it’s a great idea and I am happy to make you a co-owner of that if you are interested. Can you ping me on Twitter again or drop me a line here https://techtldr.com/contact/ and we can continue the discussion via email.

      • Now that I think about it, I can just move the article version into a branch and make master a living thing. The repo already has 69 starts, so it would be a shame to give it up 🙂

        • I also faced the same error. It’s because of the ‘google-api-python-client’ version. Install the google-api-python-client as:

          pip install google-api-python-client==1.6.4

  12. So my previous post I’ve solved all the issues that came about and reading over the comments the following function may help others too. I found that reducing the silence blocks much like what would be useful for podcasts solved all issues with returning null transcripts.

    Silence how-to https://digitalcardboard.com/blog/2009/08/25/the-sox-of-silence/

    remove_silence () {
    tempfile=date '+%Y%m%d%H%M%S'

    Removes short periods of silence

    sox $1 $tempfile.wav silence -l 1 0.1 1% -1 2.0 1%

    Shorting long period of silence and ignoring noise burst

    sox $1 $tempfile.wav silence -l 1 0.3 1% -1 2.0 1%

    mv -v $1 $tempfile'_original_'$1
    mv -v $tempfile.wav $1

    }

  13. Hi Alex, I’ve been updating the components of processing larger files and the fast and slow scripts are pausing on seemingly kosher wav files, and the fast script seems to bring down the network even when I bring down the threads, I was wondering if there were any thoughts on writing out the transcription files more often so that the whole batch of queries is not lost? And has anyone updated the script to work a little more failsafey over a say 10 hour audio chunk? Thanks a bunch its nice to have something to use to bring down the cost of online transcription services!

  14. Hi Alex,

    I am using a shorter version of the code on a single file:

    ##############
    import speech_recognition as sr

    r = sr.Recognizer()

    with open(“api-key.json”) as f:
    GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()

    test_audio = sr.AudioFile(‘C://users//me//desktop//page2.wav’)
    with test_audio as source:
    audio = r.record(source)

    r.recognize_google_cloud(audio, language = ‘es-MX’,
    credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
    ##############

    but I am getting two error messages for this snippet. The first is ModuleNotFoundError: No module named ‘oauth2client’. I have pip installed oauth2client as well as oauthlib and google auth.

    The second related error is:
    RequestError: missing google-api-python-client module: ensure that google-api-python-client is set up correctly.

    I haven’t been able to solve these issues despite troubleshooting at length. Do you have any idea how to fix this?

  15. Hi Alex,

    First off, thank you so much for this code! Now, I don’t know if the below error is an issue from my side or GCloud is being messy, but I would love any help you and this community can provide. Here is my error –

    Traceback (most recent call last):
    File “C:\Python36\lib\site-packages\speech_recognition__init__.py”, line 930, in recognize_google_cloud
    response = request.execute()
    File “C:\Python36\lib\site-packages\oauth2client_helpers.py”, line 133, in positional_wrapper
    return wrapped(*args, **kwargs)
    File “C:\Python36\lib\site-packages\googleapiclient\http.py”, line 842, in execute
    raise HttpError(resp, content, uri=self.uri)
    googleapiclient.errors.HttpError:

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File “fast.py”, line 28, in
    all_text = pool.map(transcribe, enumerate(files))
    File “C:\Python36\lib\multiprocessing\pool.py”, line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
    File “C:\Python36\lib\multiprocessing\pool.py”, line 608, in get
    raise self._value
    File “C:\Python36\lib\multiprocessing\pool.py”, line 119, in worker
    result = (True, func(*args, **kwds))
    File “C:\Python36\lib\multiprocessing\pool.py”, line 44, in mapstar
    return list(map(*args))
    File “fast.py”, line 21, in transcribe
    text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
    File “C:\Python36\lib\site-packages\speech_recognition__init__.py”, line 932, in recognize_google_cloud
    raise RequestError(e)
    speech_recognition.RequestError:

    I’ve waited for 10 minutes after enabling the API and tried again, but no luck.

    Thanks in advance.
    Regards,
    Rashmil.

    • Hi Alex and Rashmil,
      Have you found any solution to this issue. I have the same issue and dont know how to proceed.
      Thanks in advance
      Best
      Ali

      • Hi Alex,
        After changing the sound file I had better results. Still if google.cloud could not recognize some parts of the audio an error pops. So is there any way to tell google client to ignore if some parts of the audio not clear.

  16. Thank you so much for providing this code. I would like to run the code for 100 audio file. How would that be possible?

    • Not sure, I think if you look at the pull requests in the repo somebody automated file conversion (although I haven’t merged that in yet). From there you may be able to automate it further.

  17. Hi Alex, thanks for sharing your code. I managed to run it as it is and also used different mp3 audio files, which I converted to wav using Audacity. Works perfectly! I will trying using a microphone as an audio source.

    Once more many thanks.

    Gideon

  18. Thank you for this grate work. I follow your steps, but I faced this error:
    “C:\Program Files (x86)\Python37-32\python.exe” C:/Users/hudad/PycharmProjects/speech-to-text-master/slow.py
    0%| | 0/3 [00:00<?, ?it/s]
    Traceback (most recent call last):
    File “C:\Users\hudad\AppData\Roaming\Python\Python37\site-packages\speech_recognition__init__.py”, line 885, in recognize_google_cloud
    try: json.loads(credentials_json)
    File “C:\Program Files (x86)\Python37-32\lib\json__init__.py”, line 348, in loads
    return _default_decoder.decode(s)
    File “C:\Program Files (x86)\Python37-32\lib\json\decoder.py”, line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    File “C:\Program Files (x86)\Python37-32\lib\json\decoder.py”, line 355, in raw_decode
    raise JSONDecodeError(“Expecting value”, s, err.value) from None
    json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File “C:/Users/hudad/PycharmProjects/speech-to-text-master/slow.py”, line 19, in
    text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
    File “C:\Users\hudad\AppData\Roaming\Python\Python37\site-packages\speech_recognition__init__.py”, line 886, in recognize_google_cloud
    except Exception: raise AssertionError(“credentials_json must be None or a valid JSON string”)
    AssertionError: credentials_json must be None or a valid JSON string

    Process finished with exit code 1

    Please help

  19. Luke, your last audio file is crashing the code because there is no speech to transcribe, listen to your last file, if it is just music and no voice, delete it and it should work.

  20. Hey Alex,

    Thanks for putting together the comprehensive tutorial and code – I’ve managed to transcribe some of my own audio but am running into problems with other files.

    I have a collection of files, all of which I’m converting to mono @ 48000hz (doing this to remove variables for debugging) and then running through fast.py.

    The problem I’m encountering appears to occur when attempting to process the final 30s audio chunk in the ‘parts’ folder. For example, my current file has been split into 74 parts – all of which were successfully processed apart from #74.

    This is the traceback I’m getting:

    Traceback (most recent call last):
    File “fast.py”, line 28, in
    all_text = pool.map(transcribe, enumerate(files))
    File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py”, line 253, in map
    return self.map_async(func, iterable, chunksize).get()
    File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py”, line 572, in get
    raise self._value
    speech_recognition.UnknownValueError

    Do you have any suggestions why this might be the case?

    Unsure why it’s working fine for some files, but not for others.

    Thanks
    Luke

  21. Very good job. Thank you.
    I tried your code for my country France (World champion ;=)). Excellent
    Change in fast.py
    1/ text = r.recognize_google_cloud(audio_data=audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS,language=”fr-FR”)
    2/ transcript = transcript + “{:0>2d}:{:0>2d}:{:0>2d} {}\n”.format(h, m, s, t[‘text’].encode(‘utf8’))
    and it’s OK to have text in french language.

  22. Hi Alex,
    Your code is very helpful…can you tell me what will be code for Punctuation of the end of the line.

    Please share me …….

    Regards,
    Milan

  23. Hi,

    I am getting below error::

    “Sync input too long. For audio longer than 1 min use LongRunningRecognize with a ‘uri’ parameter.”>”

    Which I understand is due to the length of the audio file(more than 1 min). I googled the error and I got the suggestion mentioned in the web link:-

    https://stackoverflow.com/questions/44835522/why-does-my-python-script-not-recognize-speech-from-audio-file?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

    The above link unltimatly takes me to the below sample code

    =======================
    def transcribe_gcs(gcs_uri):
    “””Asynchronously transcribes the audio file specified by the gcs_uri.”””
    from google.cloud import speech
    from google.cloud.speech import enums
    from google.cloud.speech import types
    client = speech.SpeechClient()

    audio = types.RecognitionAudio(uri=gcs_uri)
    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000,
        language_code='en-US')
    
    operation = client.long_running_recognize(config, audio)
    
    print('Waiting for operation to complete...')
    response = operation.result(timeout=90)
    
    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        print(u'Transcript: {}'.format(result.alternatives[0].transcript))
    

    print(‘Confidence: {}’.format(result.alternatives[0].confidence))

    So does this means I will have to re-write the code using different sets of module, or can we adjust the “.long_running_recognize” function somewhere in your code?

    amitesh

  24. Hi Alex, does Google Speech to Text API support multi-speaker recognition while transcribing? Also, does it output timestamps for each word or sentence as well? Sorry for shooting so many questions, but my final question is does it have a offline version that one can use? Thanks.

  25. Hello alex, i tried to generate an api key and it says that i have to create a billing account which requires credit card infromation.So, how does it work? Is that free? Do i need to pay to get the script work?Thanks.

    • Yes, unfortunately credit card is required to register, but they do offer a free tier, so you shouldn’t be charged anything.

  26. How can we use this google API to convert streaming speech to text? What should be our code be looking like?

  27. Hello Alex,

    I am at the very early stage of this activity. i.e. I have installed all the libraries mentioned by you. I am using windows 10 to perform the activity.

    I wanted to generate the API key, but I guess I need to pay for that, right? Second, I couldn’t locate “API Manager” in the google cloud console. All I could see 3 tiles

    • I am not sure. You should be able to do it under the free trial. Re UI, may be they redesigned it. Seems like other people were able to get it to work. I’ll have to check it out later. If anybody knows, please comment.

  28. The ffmpeg command “ffmpeg -i source/genevieve.wav -f segment -segment_time 30 -c copy parts/out%09d.wav” doesen’t work when i try to run it,
    Guessed Channel Layout for Input Stream #0.0 : mono
    Input #0, wav, from ‘source/genevieve.wav’:
    Duration: 00:01:10.33, bitrate: 768 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s
    [segment @ 0000021e48be0dc0] Opening ‘parts/out000000000.wav’ for writing
    [segment @ 0000021e48be0dc0] Failed to open segment ‘parts/out000000000.wav’
    Could not write header for output file #0 (incorrect codec parameters ?): No such file or directory
    Stream mapping:
    Stream #0:0 -> #0:0 (copy)
    Last message repeated 1 times

    I don’t know how to fix it or what am i doing wrong.

  29. Found a way to avoid breaking up a long audio file:
    1. Convert the audio file to FLAC (downmix from stereo to mono) — Audacity can export to FLAC, make note of the bitrate
    2. Upload FLAC file to Google Cloud Storage — create new bucket if need be, no need to make it public
    3. Edit transcribe_async.py — find bitrate for FLAC and change it accordingly also update the timeout value to 600 (10m)
    4. Run command: python transcribe_async.py gs://bucketname/filename.flac

  30. Hello Alex, thank you very much for your collaboration.
    Alex, if I wanted to change the language of the API, for example, the parameter language_code = ‘es-CO’, where should I do it? Thank you

  31. Here’s something I tried. I already had WAV recordings I obtained from an MP3 Player.
    Hence, I decided to skip the MP3->WAV conversion step.
    I ran into multiple errors, mainly due to format inconsistency with the native WAV type.
    And so, I’m posting this.
    I’ve used “VOICE001.wav” as an example. It works well with MP3 inputs as well.
    For MP3, skip step 1.

    Converting to the right WAV format

    Check for your WAV file’s properties.
    ffprobe VOICE001.wav
    # Input #0, wav, from ‘VOICE001.wav’:
    Duration: 00:01:16.54, bitrate: 128 kb/s
    Stream #0:0: Audio: adpcm_ima_wav ([17][0][0][0] / 0x0011), 32000 Hz, 1 channels, s16p, 128 kb/s
    Convert & Replace the WAV file to native type using Audacity.
    Again
    ffprobe VOICE001.wav
    # Input #0, wav, from ‘VOICE001.wav’:
    Duration: 00:01:16.28, bitrate: 512 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 32000 Hz, 1 channels, s16, 512 kb/s
    For remaining WAV files, use the native format details for conversion using ffmpeg.
    ffmpeg -i VOICE001.wav -acodec pcm_s16le -ar 32000 VOICE001-win.wav
    # Output #0, wav, to ‘VOICE001-win.wav’:
    Metadata:
    ISFT : Lavf58.3.100
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 32000 Hz, mono, s16, 512 kb/s
    Metadata:
    encoder : Lavc58.9.100 pcm_s16le
    size= 4768kB time=00:01:16.28 bitrate= 512.0kbits/s speed= 246x
    video:0kB audio:4768kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.001598%

    * Here, The Audio Codec & Sampling Rate fields have been altered to fit the native format settings.

  32. I tried using the code with the source files that you provided (genevieve.wav), however I get the following error:

    ValueError: Audio file could not be read as PCM WAV, AIFF/AIFF-C, or Native FLAC; check if file is corrupted or in another format

    I did not change any code. Any ideas on what I’m doing wrong here?

    • Did you generate parts with ffmpeg?

      I just re-run it fresh and it worked for me.I am using Python3 on MacOS.

      What system are you on, at what point does it fail?

    • Hi,

      Like @Jamshed, I’m getting that same error when I run on genevieve.wav : ValueError: Audio file could not be read as PCM WAV, AIFF/AIFF-C, or Native FLAC; check if file is corrupted or in another format

      It also includes this in the result: wave.Error: file does not start with RIFF id.

      I check the file:

      $ file out000000002.wav
      out000000002.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 48000 Hz

      $ file -i out000000002.wav
      out000000002.wav: regular file

      `$ mediainfo out000000002.wav
      General
      Complete name : out000000002.wav
      Format : Wave
      File size : 966 KiB
      Duration : 10 s 302 ms
      Overall bit rate mode : Constant
      Overall bit rate : 768 kb/s
      Writing application : Lavf56.36.100

      Audio
      Format : PCM
      Format settings : Little / Signed
      Codec ID : 1
      Duration : 10 s 302 ms
      Bit rate mode : Constant
      Bit rate : 768 kb/s
      Channel(s) : 1 channel
      Sampling rate : 48.0 kHz
      Bit depth : 16 bits
      Stream size : 966 KiB (100%)`

      So I’m wondering if something is wrong with my ffmpeg install? Any advice appreciated, and thank you for sharing all this.

        • I solved it. It seemed to be conflicting packages in my python install. I set up a fresh python3 environment, re-installed ffmpeg etc, and it works really really well now. Thanks!

  33. Hi Alex

    One issue I found is that if the number of files in the parts folder exceed the pool workers,say you have 20 files in the parts folder and u have pool = Pool(8) only the first 8 files are processed in ORDER and after that alll remaining files in the parts folder are processed OUT of sequence. Tried a few thing but still not working. Seems like even though the map function is supposed to keep the sort order but for some reason the order is only kept for the first 8 files.

      • Using aws ec2 amazon Linux , python 36,
        Have a wav file about 60 mb, I partikn the file in 55 or 60 sec, generates about 57 files in the part folder,use a pool size of 8, the first 8 files are in order, the remaining are all in mixed orders.
        Tried to sort the list first and confirmed that’s in order but after the first 8 files, the order is lost. Trying the google asyn but not working yet.

        • Reading over the code I see that I am taking by an extra step to sort by idx. So the only thing I can think off of those ids come in the wrong order.

          Can you confirm that when you call os.listdir files show up in the right order?

          • No, they are not and what I had done was to apply sort like: file = sorted(os.listdir(‘parts/’). If I don’t use the sort, the entire transcript is all over the place, meaning the beggining of the wav file could be transcribed in the middle of the text and so on. Next I applied sort(os.listdir(‘parts/’) and confirmed in the shell that all the “files” are sorted. Next I ran the script and I confirmed that ONLY the first batch of the pool (in this case only the first 8 files) are ordered correctly, the next pool worker loses the sort again. do you know what I mean?

            here is the list dir wihtout the sort:

            import os
            files=os.listdir(‘parts/’)
            files
            [‘0039.wav’, ‘0048.wav’, ‘0029.wav’, ‘0007.wav’, ‘0025.wav’, ‘0013.wav’, ‘0030.wav’, ‘0020.wav’, ‘0041.wav’, ‘0016.wav’, ‘0010.wav’, ‘0037.wav’, ‘0012.wav’, ‘0017.wav’, ‘0028.wav’, ‘0044.wav
            ‘, ‘0038.wav’, ‘0009.wav’, ‘0000.wav’, ‘0024.wav’, ‘0031.wav’, ‘0022.wav’, ‘0023.wav’, ‘0045.wav’, ‘0043.wav’, ‘0036.wav’, ‘0026.wav’, ‘0018.wav’, ‘0014.wav’, ‘0003.wav’, ‘0008.wav’, ‘0005.w
            av’, ‘0046.wav’, ‘0002.wav’, ‘0033.wav’, ‘0042.wav’, ‘0027.wav’, ‘0011.wav’, ‘0004.wav’, ‘0040.wav’, ‘0019.wav’, ‘0001.wav’, ‘0021.wav’, ‘0032.wav’, ‘0006.wav’, ‘0015.wav’, ‘0047.wav’, ‘0034
            .wav’, ‘0035.wav’]

            Here is the list dir with sort

            import os
            files=sorted(os.listdir(‘parts/’))
            files
            [‘0000.wav’, ‘0001.wav’, ‘0002.wav’, ‘0003.wav’, ‘0004.wav’, ‘0005.wav’, ‘0006.wav’, ‘0007.wav’, ‘0008.wav’, ‘0009.wav’, ‘0010.wav’, ‘0011.wav’, ‘0012.wav’, ‘0013.wav’, ‘0014.wav’, ‘0015.wav
            ‘, ‘0016.wav’, ‘0017.wav’, ‘0018.wav’, ‘0019.wav’, ‘0020.wav’, ‘0021.wav’, ‘0022.wav’, ‘0023.wav’, ‘0024.wav’, ‘0025.wav’, ‘0026.wav’, ‘0027.wav’, ‘0028.wav’, ‘0029.wav’, ‘0030.wav’, ‘0031.w
            av’, ‘0032.wav’, ‘0033.wav’, ‘0034.wav’, ‘0035.wav’, ‘0036.wav’, ‘0037.wav’, ‘0038.wav’, ‘0039.wav’, ‘0040.wav’, ‘0041.wav’, ‘0042.wav’, ‘0043.wav’, ‘0044.wav’, ‘0045.wav’, ‘0046.wav’, ‘0047
            .wav’, ‘0048.wav’]

            But still for some reason only the first batch of the pool workers are in the right order in the transcribe file, starting 0009.wav on wards the transcribe file is no longer in order.
            Even though the map function is supposed to keep the order.
            Strange

          • Even if map doesn’t keep them sorted, this line sorted(all_text, key=lambda x: x['idx']): shoudl re-sort them back in order.

            Try to debug this sort/idx and see if something funky happens around there.

          • I am having the same problem as daz… I added the sort also and it is not sorting correctly.. ( on the fast)

            I am testing the slow ( unthreaded verison) to see if it is the threading that is causing the ordering problem.

            files =sorted(os.listdir(‘parts/’))

            parts/out0000.wav started
            parts/out0002.wav started
            parts/out0006.wav started
            parts/out0010.wav started
            parts/out0014.wav started
            parts/out0008.wav started
            parts/out0004.wav started

    • I just didn’t know that was an option. Thanks for the tip, I’ll have to investigate. May be it was just a limitation of the library I was using.

    • Tried the google async example but it fails half way through. Do u have a working example with the google async to concert a wav file to text?

      Thanks

  34. Is there a way to overcome the 30 sec limitation where I can do the whole file in one try? Or if I have to break the file would it be possible to have the transcript numbered? Like if the input wave file is wave01.wav wave02.wav the output be transcript0102.txt? thanks for the great script.

      • Here is the use case: I have multiple wav files, Alex.wav, Vida.wav, Jim.wav. I like to modify the program such that it reads the inputwav folder containing all the wav files (alex.wav, vida.wav, jim.wav) and runs it through the python program to output alex_transcript.txt, vida_transcript.wav, jim_transcript.wav. But I am having difficulty getting it to work. So I ran each files individually. Thanks Alex

        • Ah I c. Yes, then it goes back to figuring a way to convert file to proper wav programmatically and then calling the split files command (and probably adding a clean up step later).

          I didn’t get this far.

          Another idea that I didn’t get to do is splitting file by silence around 30 seconds, instead of hard 30 second split, which can cut mid sentence/word.

          God luck! Let me know if you figure any of this out.

  35. “ffmpeg -i input.mp3 output.wav” converts the mp4 file to wav file without any compression.
    It is better to have a command to do the task instead of a new software if we are we are automating a task

    • Unfortunately something was off about this type of wav, which I did not dig in to figure out. Transcription did not work with wav created like this. May be it was just something local to my Mac.

      • I tried the same thing, but for some reason I think it read the wav file backwards, meaning it starts from the end of the file transcribing. Thanks Alex for pointing this out. I go back to using Audacity.

      • Thanks for writing all this up! It’s been super helpful. Not sure if it’s still an issue, but I had the same problem. It seems like ffmpeg ignores the format when you’re doing the segmentation… Running it in two lines works for me, though there was probably a better way to actually fix the problem. ffmpeg -i db/foo.m4a -c:a pcm_s16le db/stage1.wav

Comments are closed.