HomeArchitectureMoving up the stack - time to learn Python - part 11

Moving up the stack – time to learn Python – part 11

In our eleventh article in this series, we continue our journey into learning Python.  I am happy you are still with me, we finished our last article with a working speech to text transcriber. Truthfully this application is a little limited,  whilst capturing your speech and directly turning it into a text document is useful,  it is a simple usecase.  Today we are going to upgrade the functionality of the program to accept audio and Youtube videos for transcribing.

To refresh your memory on our earlier articles, you can read them at the links shown below:

As said, our last article left you with a program to capture audio directly from your microphone and create a text document out of it, Today, we take that application and add functionality to it by adding the ability to take an audio file as an input and feed it through our Cloud function to create a transcription.  As we move down this path increasing functionality hopefully, we will end up with something useful.

Why are we doing this and how is Python going to help us?

There are multiple ways to convert an audio file to a text file but the vast majority of this are commercial products or limited in the number of words or amount of time of audio, usually 5 minutes.   With the raising number of online meetings being undertaken transcription can be slow, if manually undertaken, or expensive if you are utilising an online service.  Hence our quick and dirty method of rolling your own so quickly enable the creation of minutes and actions, or to get a jist of what the meeting was about without the need for 100% accuracy.  Couple this with the need to get more done with less time and automating tasks has become a necessity rather than a luxury.  And finally, as we are in a cost-of-living crisis, just throwing money at the problem may not be an option.  What this script will do is utilize Azure to convert audio to text. Microsoft Azure has a Speech to Text conversion service.  We will be using this as the core conversion functionality.  Today we will take our simple to speech to text conversion and expand its functionality to accept audio files rather than just direct speech to the default microphone as the source for the transcription.


The majority of the prerequisites were installed during the previous post.  But there will be a few more packages that need to be installed as we move through this article adding features.

Modifying our Python script

In the first article we built on a script from Microsoft learn. The first thing we are going to do today is alter the file from accepting input from the microphone to accepting a file as the source input.  One of the things I noticed when I was using the original python script is that it only captured speech until the first pause in the flow.  This is because we used the simple “recongize_once” method in our script,  this will only capture the a single continuous stream of test or until there is a pause in the conversation,  not particularly good for transcribing an audio file of unknown length or word count or even a conversation.  So it seems like a major rewrite of the core logic is necessary for the to be useful.

The only change to the original script here is that we have imported the “time” module, this is a build in package so there is no need to use the “pip” command.

import os
import time
import azure.cognitiveservices.speech as speechsdk

next we create our function, “Transcribe_audio_to_text” this function will take the inputs “audio_file_path” and “output_text_file_path” as arguments.  You will recognise the speech_key and Service_region variable from the previous script.  They should be the same, unless you have physically changed your key.

We then create a number of additional variables and classes: “speech_config”, “audio_input”, “speech_recognizer”, and “done”.   The “speech_config” is a class and is used to initialize the SpeechConfig object.  The SpeechConfig class includes information about your subscription, such as your key and associated location/region.  The next line creates the “audio_input” which uses the AudioConfig clase to accept the input source of the audio stream,  in this case, we are using a test speech audio file called “Harvard.wav” this can be downloaded from several locations and is free from royalities so is an excellent test subject file I downloaded my version from here.  Finally both the created class and variable are used to build the “speech_recongizer” variable;  This creates the object that is used later in the file to create the functions output.  The final line creates a empty variable that takes the form of a list.

def transcribe_audio_to_text(audio_file_path, output_text_file_path):
    # Replace with your own subscription key and region identifier from Azure.
    speech_key, service_region = "subscription key", "azure subscription region"
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    audio_input = speechsdk.AudioConfig(filename="harvard.wav") #audio_file_path
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)
    done = False
    all_results = []

Next, we create a second function, this our first outing of a nested function, a nested function is only available within the original function.  This snippet of code defines a function named “stop_cb” that takes one parameter ”evt”. The function is intended to be used as a callback that stops continuous recognition upon receiving the event “evt”. The function prints the message ‘CLOSING on {evt}’ to the standard output, where “{evt}” is replaced by the value of the parameter.  Then, the function uses the nonlocal keyword to declare that the variable “done” is not local to the function but belongs to the nearest enclosing scope ie the “transcribe_audio_to_text” function where its value was set as “false”. The function assigns the value “True” to the variable “done”, which will affect the variable in the outer scope.  The “nonlocal” keyword is new and is used to work with variables inside nested functions, where the variable should not belong to the inner function.

Now you may be wondering where we actually obtain the value of the “evt” parameter. This is obtained from the “speech_recognizer” variable.  In the case of this particular function it takes outputs of “canceled” or “session_stopped” to trigger the running of the function and change the value of the “Done” variable to “true” and write out the message “CLOSING on {}.format(evt)”  where the contents of the curly bracket is the value of either “canceled” or “session_stopped”.

    def stop_cb(evt):
        """callback that stops continuous recognition upon receiving an event `evt`"""
        print('CLOSING on {}'.format(evt))
        nonlocal done
        done = True

Next, we create our final nested function called “handle_final_result(evt)”, this is the meat and two veg of our new function.  This function takes the value of the “evt” function as its parameter.  In this case the value accepted is “recognized”

The event object contains the result of the “done” variable and the function will continue until the stream of captured text terminates and changes the value from “false” to “true.

The next line populates the “all_results” list variable with the contents of resultant text captured by the “speech_recognizer” function in Azure.   This means that the function adds a new element to the end of the list, and that element is the string value of “evt.result.text”.

    def handle_final_result(evt):

This is the main part of the code, the final act so to speak.  It creates an instance of the “Recognizer” class, which is the main class for recognizing speech. The instance is stored in a variable called “speech_recognizer”.

It connects the “handle_final_result” function to the recognized event of the “speech_recognizer”. This means that every time the recognizer recognizes some speech, it will call the function and pass the event object to it.

It also connects two other functions, “stop_cb”, to the “session_stopped” and “session canceled” events of the “speech_recognizer”. These functions are assumed to be defined elsewhere and they handle the cases when the recognition session is stopped or cancelled by some reason.

It calls the “start_continuous_recognition” method of the “speech_recognizer”, which starts a continuous recognition session. This means that the recognizer will keep listening to the audio source and recognizing speech until it is stopped or cancelled.

It enters a while loop that checks a variable called done, which is assumed to be defined globally and indicates whether the recognition session is finished or not. Inside the loop, it calls the “time.sleep” function with a parameter of 0.5, which means that it pauses the execution for half a second. This is done to avoid consuming too much CPU resources while waiting for the recognition results.

After the loop ends, it prints the contents of the “all_results” list, which contains all the recognized text from the audio source. It also opens a file with a path stored in a variable called “output_text_file_path”, which is assumed to be defined globally, and writes all the recognized text to it.

    # Connect callbacks to the events fired by the speech recognizer
    # Start continuous speech recognition
    while not done:
    print("Transcription: {}".format(all_results))
    with open(output_text_file_path, 'w') as f:
        for result in all_results:

The final, line is the actually calling our function, which names the input file and output file

transcribe_audio_to_text('harvard.wav', 'output.txt')

However the first time I ran this script I received the following error:

Unexpected output
This is not expected, why have we got an session cancelled error when the trascript was successful

Which is not what I expected, I mean the transcription worked as is shown by the contents of the Transcription section.   So lets fix that logic issue so that we have a clean exit.  I was thinking that the issue is with the fact that we are using start_continuous_recognition”, which keep the link active until it is actually stopped or cancelled.

Fixing our error

It seems that the actual issue is in the “stop_cb” function,  it has limited logic, and was running no matter what the output.  By adding the if statement that stated that if the “evt.reason” was “cancelled” then print out the error, fixed the issue.

def stop_cb(evt):
    """callback that stops continuous recognition upon receiving an event `evt`"""
    nonlocal done
    done = True
    if evt.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = speechsdk.CancellationDetails.from_result(evt.result)
        print('Error details: {}'.format(cancellation_details.error_details))

once again, a simple if statement to the rescue.  Re-running the script now returns just the transcribed text.

expected output
That’s better, we now just see the transcription.


In this, the second part of our application creation articles, we create an application that will take input from either default microphone or an inputted audio file.  We found that we had to rewrite our application to use a different speech recognition option, this time “start_continous_recognition”.  We also showed how to use an audio file as the input and run that through the Azure Speech to text service and obtain a transcription, we then had to solve an unexpected output.  In our next article we will merge the speech to text code that takes its impute from a microphone and this script to create a file that can take input based on a decision.  We will also wrap it up in a “nice” graphical interface.


Receive our top stories directly in your inbox!

Sign up for our Newsletters