HomeArchitectureMoving up the Stack - time to learn Python - part 10

Moving up the Stack – time to learn Python – part 10

As we enter into double figures on your journey into Python, we finished our last article by learning how to use Pyinstaller and converted our nifty little graphical password generator into a standalone application, that bundled all the necessary files to run on any windows machine. I thank you for continuing the journey.

To refresh your memory on our earlier articles, you can read them at the links shown below:

As said, our last article left you with a completed complied application, Today we look at another more complicated application, this time something potentially useful, a audio to text converter. Now obviously this will not be a fully featured audio to text converter. More of a proof of concept.

What exactly are we going to do and how is Python going to help us?

There are multiple ways to convert an audio file to a text file but the vast majority of this are commercial products or limited in the number of words or amount of time of audio, usually 5 minutes. With the raising amount of online meetings being undertaken transcription can be slow, if manually undertaken, or expensive if an online service.  Hence the quick and dirty method of rolling your own so quickly enable the creation of minutes and actions, or to get a jist of what the meeting was about without the need for 100% accuracy. Couple this with the need to get more done with less time and automating the task become a necessity rather than a luxury. And as we are in a cost of living crisis, just throwing money at the problem may not be an option. What this script will do us utilize the Azure.  Azure has a Speech to Text conversion service.  We will be using this as the core conversion functionality. To keep it simple to start with we will create a simple speech to text conversion. Don’t worry this resource will be used later.


Before we start this process, there are a number of prerequisites that need to be inplace.  Quite obviously we need an azure account. We also need a speech resource, and our keys. So before we start python, lets carry out these processes. It is a working assumption that you already have an Azure subscription, if not follow the process to create a free account here.

Creating your Speech resource

Microsoft AI Speech to text setup 1

For the purpose of this article, we are going to create a Speech to Text resource using the Free F0 tier. This will give you 60 minutes of conversion. If you need to move to the paid teir, the price implications can be found here.  We can click review and create here and the resource will be created, however curiosity killed the cat so they say, lets click next.

Microsoft Speech to text network set up

The first thing we can see is that we are asked about network connectivity,  in a production environment we could set this to disabled and configure a private end point to protect it. However, this is a test environment, it is on the free tier, and it will be destroyed after this article I am leaving it default.

Microsoft Speech to text Identity setup

Again, we are going to leave these default as shown above.

Microsoft Speech to text Tag setup

No tags, so lets just click Review and Create.

Microsoft Speech to text Review before create

Click Create and the resource will deploy on a successful deployment you will receive a response similar to below.

Microsoft Speech to text verification

Click on Go to resource and locate the “Keys and Endpoint section to recover your keys and region. Store these for later use.

Microsoft Speech to Text Key recovery.

Note: If this was a production environment you would also create a key cycling process, that will rotate keys to protect your service.

Setting up the environment

Now that we have set up the backend Azure environment, we need to configure our python environment.  As per our previous projects we need to install some third party modules, this time it is the “azure-cognitiveservices-speech” module,  this can be found here on pypi.org.  once again install it using “pip”

pip install azure-cognitiveservcies-speech

A successful deployment will result in a response similar to the following:

python pip install azure-cognitiveservices-speech

Next if you are using a windows environment you will need to verify that you have Microsoft Visual C++ Redistributable installed.

Setting environment variables

To use this python script in anger you must be able to authenticate to Azure to access the AI service. As we know it is not recommended to have access keys directly installed in your scripts or code, so we will need to set a couple of environment variables to store your key and region. We will call these “sp_key” and “sp_region” for Speech Recognition key and Speech Recognition region respectively.

To set these in windows enter the following commands using one of the two keys you saved earlier:

setx sp_key  ad7c7XXXXXXXXXX697f30e92159c58f5
setx sp_region uksouth

Successful completion will result in the following responses:

configure the neccessary environment variables using setx

You may need to restart your terminal environment once you have issued these commands.

Creating our Python script

Sometimes it is easier to build from scratch, sometimes it is easier to borrow.   This is one of those times. The script below is taken from Microsoft learn.  We will be using this as our primer script, and expanding it out with new code.  However as normal we will explain what each section does.

import os
import azure.cognitiveservices.speech as speechsdk

As per all python scripts we start by importing the external modules, starting with “os”. This module provides a portable way of using operating system dependent functionality with in your script. The second module is importing “azure.cognitiveservices.speech” module as “Speechsdk”.  Next we have created a function called “recognize_from_microphone()”. This function is split into several sections the first the sets up the Speech Configuration class,

speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('sp_key'), region=os.environ.get('sp_region'))

speechsdk.SpeechConfig(…)” is a class that is provided by the Azure Speech SDK and accessed usings speechsdk)  it is used to access the Azure environment.  Within the brackets we import our two environment variables (sp_key and sp_region). Starting with “subscription=os.environ.get(‘sp_key’)” , you will notice that we are now using our second imported module to grab the value.  The same process is used to import the “sp_region” environment variable.  Our final entry set the language to “en-US”; speech_config.speech_recognition_language=”en-US”

The second section of the function configures where we are obtaining our input, setting the script to take the input of the systems default microphone.

audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)

The next line creates an instance of the “SpeechRecognizer” class, which is again provided by the Azure Speech SDK and represented by our imported module speechsdk. This “SpeechRecognizer” takes our previously created “speech_config” and “audio_config” objects will be used to perform speech recognition.

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

This section is the meat and two veg of the functions, prior to this we have been creating our inputs.  Now we set the magic in motion.  Our first time is a simple print to screen of “speak in to your microphone” nothing interesting here, but it does sort of focus the mind on the task at hand

print("Speak into your microphone.")

The next line of code is where the actual speech recognition happens.

speech_recognition_result = speech_recognizer.recognize_once_async().get()

speech_recognizer.recognize_once_async(): This method is provided by the “SpeechRecognizer” object created in the previous section that starts the recognition operation. The “recognize_once_async” method recognizes the speech obtain from the configured audio source (in this case, our default microphone) just once (hence the “once” in the name). What this actually means is that it runs asynchronously, which means it can run in the background without blocking (or stopping) the rest of your program.  The “.get()” is called on the result of “recognize_once_async()”. Because “recognize_once_async()” is an asynchronous operation, it actually returns a “future” object that represents the result of the operation, but the result isn’t immediately available. What this call actually goes is effectively waits for the operation to finish and then gets the actual result.

Our final section is handles the speech recognition results.  The following lines are a part of an if/then decision tree based on the potentially different outcomes of input of the microphone. Our first response is run if the speech was recognized and captured successfully:

if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print("Recognized: {}".format(speech_recognition_result.text))

“speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech” This line is checking the reason of the result from the speech recognition. The reason attribute of “speech_recognition_result” tells us why we got this result. If it equals “speechsdk.ResultReason.RecognizedSpeech”, it means that the speech was successfully recognized.  We then print to the screen the captured speech using the “print(“Recognized: {}”.format(speech_recognition_result.text))”.  The “{}” is a placeholder that gets replaced with “speech_recognition_result.text”, which is the text that was recognized from the speech and returned from the Azure service.

The next line is run if there was no speech recognised the following line is run.

elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
     print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))

Just like the logic in the previous line, this line “speech_recognition_result.reason == speechsdk.ResultReason.NoMatch” checks the reason for the result from the speech recognition operation. The reason attribute of “speech_recognition_result” again informs us why we got this result and if the result equals “speechsdk.ResultReason.NoMatch”, it meant that there no recognizable speech the the Azure service could find in the audio input. Then next line “print(“No speech could be recognized: {}”.format(speech_recognition_result.no_match_details))” will print out a message saying “No speech could be recognized”, followed by the details of why no match was found; again the “{}” is a placeholder that gets replaced with the information provided by the “speech_recognition_result.no_match_details”.

The final “elif” of this section will run if the input was cancelled;

elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
   cancellation_details = speech_recognition_result.cancellation_details
   print("Speech Recognition canceled: {}".format(cancellation_details.reason))
   if cancellation_details.reason == speechsdk.CancellationReason.Error:
      print("Error details: {}".format(cancellation_details.error_details))
      print("Did you set the speech resource key and region values?")

Again we see the similar logic with the line “speech_recognition_result.reason == speechsdk.ResultReason.Canceled” which again checks the reason for the result from the speech recognition operation.  If the result equals “speechsdk.ResultReason.Canceled”, it means that the speech recognition operation was cancelled.  We then drop into another if/else statement which checks in deeper details the reason for the cancellation.  The line “cancellation_details = speech_recognition_result.cancellation_details” This line grabs the details about why the operation cancelled, these are stored in the “cancellation_details” attribute of “speech_recognition_result”.  This is then printed to screen as shown by the line “print(“Speech Recognition canceled: {}”.format(cancellation_details.reason))”.as per the previous lines the “{}” is a placeholder that gets replaced with “cancellation_details.reason”.  the next line grabs the error message (if there was one) with “cancellation_details.reason == speechsdk.CancellationReason.Error”.  This line is then printed to standard out wit the line “print(“Error details: {}”.format(cancellation_details.error_details))”.  The final line “print(“Did you set the speech resource key and region values?”):  is effectively a catchall; this line prints out a message asking if the user set the key and region values for the speech resource. This is a common reason for errors when using Azure Speech SDK, so this message is a helpful hint for troubleshooting.

Our final line calls the function “recognize_from_microphone()”, which starts the whole process. Running this at the terminal will hopefully result in a similar response as shown below,

run the python script


This is the first part of a multi article process where we create an application that will take input from either default microphone or an inputted audio file.  Today we did a lot of the groundwork, created our backend service on Azure, borrowed and created the basis of our application with our core conversion function.  In our next article we will modify the script to accept input from an audio file.


Receive our top stories directly in your inbox!

Sign up for our Newsletters