Introducing the HiPSTAS Audio Toolkit Workflow: Audio Labeling

By Tanya Clement and Steve McLaughlin

Audio preservation and access presents a significant resource management issue for libraries and archives. Digitizing sound is a task that can be partly automated, but describing a recording so its contents are discoverable requires a much more labor-intensive workflow. Imagine you have found an unlabeled cassette. You put it into your boom box, you press play, you turn on your auto-magic digitizer (i.e., your phone’s recorder) because you’d rather have an MP3, and then you get up and leave the room. The machine still plays the music and your phone still records it, but the sound is too indistinct for Siri or Shazam to recognize it. You end up with 90 minutes of audio and you have no idea what’s in it. Is it your grade school piano recital, a bootleg of that U2 concert you went to in 1988, or, quite possibly, the blank hum of white noise? You’d have to listen to every minute of it to find out. Similarly, librarians and archivists lack tools that can generate descriptive metadata on unheard audio files automatically, especially for recordings that include background noise, non-speech sounds, or poorly documented languages, all of which lie beyond the reach of sound identification and automatic transcription software.

So why not use machine learning tools?

This is the first blog post in a series that will present some DIY techniques for using freely available machine learning algorithms to help label these “unheard” recordings. The HiPSTAS Audio Tagging Toolkit, a Python package designed by Steve McLaughlin, is the result of many years work, initially funded by the NEH (in 2012 and 2013) and currently supported by a multi-year grant funded by the IMLS in collaboration with the Pop Up Archive, the WGBH Educational Foundation, and the American Archive of Public Broadcasting. Along with the interactive Audio Labeler application, the Audio ML Lab virtual environment, and the Speaker Identification for Archives (SIDApipeline, we present a set of resources that make it possible for anyone with a laptop to start automatically generating metadata for mixed-sound audio collections.

The Audio Tagging Toolkit builds on several open source audio processing tools, including FFmpegLibrosa, and aubio, to support a workflow for training and applying audio machine learning classifiers. The toolkit is designed to be accessible for programming novices, offering several readable, modifiable modules that expedite common tasks in an audio annotation workflow.

Upcoming posts will share information about the workflow and examples for speaker identification as part of the HiPSTAS project with WGBH and the Pop Up Archive, but this workflow can be adapted for other sounds as well. The workflow includes the following steps, which are explained more thoroughly below and in future blog posts:

  1. Select a set of audio files (MP3 or WAV) as a training corpus, typically several dozen or several hundred episodes of a radio or TV show. Your primary speaker(s) of interest should appear fairly often in these recordings.
  2. Launch the Audio Labeler application and apply labels to a series of randomly selected 1-second clips.  500–1000 labels is a good average number for a training set for identifying a speaker of interest.
  3. The next three steps take place in the SIDA preprocessing template:
    1. First, extract individual WAV files for each 1-second clip you just labeled.
    2. Next, extract vowel segments from these WAV clips using Audio Tagging Toolkit, creating a large collection of very short audio files. (On systems with limited processor speed and/or memory, this can help save time in the long run. If a batch process is interrupted, for instance, you can start again where you left off.)
    3. Finally, extract training features (MFCCs + deltas + delta deltas) for each vowel clip and write each clip’s features to a separate CSV file.
  4. Switch to the SIDA train and classify template  and load your saved features from CSV for each class you plan to use for training. (Reading data from CSVs is much faster than extracting features directly from audio, making trial-and-error experimentation easier.)
  5. Download pre-extracted features from the AAPB Universal Background Model dataset and add them to your non-speaker-of-interest training data. With several thousand speakers chosen at random from the AAPB’s collection of public radio and TV broadcasts, the AAPB-UBM will help you make your speaker classifier more robust.
  6. Train a classifier model using scikit-learn and save the model to disk as a Python pickle (“.pkl”) file. A simple multi-layer perceptron classifier (i.e., a shallow neural network) is a good starting point: scikit-learn’s MLPC module is versatile, fast, and easy to train.
  7. Load your saved model from the pickle file. (If you want to use the same classifier in the future, you can start from this step.)
  8. Run the classifier on a new audio file by breaking it into equal-sized windows (somewhere in the range of 1- to 5-second resolution), then averaging the model’s output across each window. For better accuracy, detect vowel segments in each window first, then discard classification values for the non-vowel portions of the audio.
  9. Write classifier output for each unseen audio file to a CSV. Depending on your needs, you may want to apply a rolling average to these values first and/or choose a cutoff threshold to convert decimal values to binary classes. Audio Tagging Toolkit includes several handy functions for these cleanup tasks.
  10. To view your results, open an audio file in Sonic Visualiser, then load its CSV-formatted classification values as a region layer, which will display your data as colored bars overlaid on the audio’s waveform. If you wish, you can correct and adjust your machine labels by hand or simply use them as a guide for a new annotation layer.

A more thorough description of the first steps in the workflow, audio labeling, is the focus of this first workflow post.

Labeling Audio for Machine Learning

The first steps in our audio machine learning workflow include assembling a set of labeled training data, i.e., examples of the categories we want our machine learning algorithm to recognize. For instance, we might train a binary classifier with labels for “music” vs. “speech,” or “Speaker A” vs. “Not Speaker A.” This technique, in which we begin with labels applied by humans, is called supervised learning, and the data we use to create a new classifier model is known as the training set. For examples, see previous machine learning applications in the HiPSTAS project, which have included finding applause in poetry performances and identifying instances of changing genres (speech, song, and instrumental) in field recordings. 

The goal of the project that we will discuss here is to train a model that identifies a single speaker’s voice. One approach to train an algorithm would be to first collect a handful of recordings that contain the speaker, then mark every point where that speaker is heard. Because listening is time-consuming, an individual labeler may be able to get through five or ten hours of material for a single speaker. Unfortunately, sometimes models trained this way perform poorly: In most cases they end up being overfit to the training set, meaning they’re too narrowly tailored to the five or ten hours of audio they’re drawn from and cannot be used to identify that same voice in other kinds of recording settings or when that speaker ages and his or her voice changes. To create more robust classifiers, we need a broader, more varied set of examples.

The workflow we will introduce here facilitates labeling audio clips for a training set at random from a corpus of several dozen or several hundred recordings. As we developed our current workflow, we focused on creating models that would identify two speakers in particular: Terry Gross, host of NPR’s Fresh Air, and Marco Werman, host of the news show The World, co-produced by WGBH and the BBC World Service. We chose Fresh Air because it posed few technical challenges: Nearly every guest is recorded in a professional studio, so the audio quality is consistent and free of background noise. And NPR.org hosts thousands of downloadable Fresh Air episodes, spanning the past three decades (although before the mid-2000s, only selected programs are available). The World was appealing precisely because it seemed challenging: It includes speakers from many countries, with many accents, recorded under wildly heterogeneous conditions. A given program may contain several phone interviews, stories filed by far-flung correspondents using a range of recording gear, and and the varied background sounds of noisy street environments. We used 20 episodes of Fresh Air and 100 from The World, all of which were chosen at random to reflect changes in the hosts’ voices over time.

To expedite the random labeling process, McLaughlin created the Audio Labeler application. This application provides a browser-based interface for labeling one-second clips chosen at random from a set of audio files provided by the user, displaying a waveform as a visual aid and offering the following four seconds to help settle ambiguities. We chose one second as the default label duration since speech segments shorter than a second can be difficult to identify on the first listen but longer clips are more likely to include multiple speakers and multiple sounds, making them useless for training classifiers to find unique voices or sounds.

As a rough rule of thumb, we find we need at least 400–500 one-second labels for a given speaker of interest, which required labeling 4,000 clips each from Fresh Air and The World for our example. In order to employ a binary classifier (Is it this or that?) or a multi-class classifier (Is it this or that or this other thing?) to identify sounds of interest, it is necessary to also have counter examples to the voice you are seeking to find. In other words, if the machine is deciding between this and that, it’s necessary to provide examples of that. For example, in labeling clips to train a classifier to identify Terry Gross across episodes of Fresh Air, our process included labeling clips of Terry Gross from different episodes but also labeling “Music”, “Background speaker,” and “Silence”: https://github.com/hipstas/podcast-speaker-labels/blob/master/Fresh_Air/Terry_Gross_labels_randomized.csv. In labeling clips to train a classifier to identify Marco Werman across episodes of The World, our labeling included snippets of Marco Werman but also of other speakers who this time we marked “Male”, “Female”, “Carol Hills”, “Multiple Speakers” (instead of simply “Background Speaker”) and other sounds such as “Music” and “Silence”:  https://github.com/hipstas/aapb-speaker-labels/blob/master/speaker_labels_randomized/The_World_WGBH_labels_100_episodes.csv . Once we’ve trained our classifiers, our model will examine a previously unseen segment of audio and assign it to one of these labels or “bins.” A decision tree shows this part of the process.

Finally, state-of-the-art speaker identification systems include thousands of individual speakers in their training sets, with those data known collectively as a “universal background model” (UBM). For example, once we’ve applied the model we’ve created above with the labels we’ve generated, we could end up with relatively few speakers in our “Background speaker” label set. A typical episode of Fresh Air, for instance, might include four or five speakers in addition to Terry Gross. With 20 episodes in our training corpus, that comes to no more than 100 speakers for our model to consider — a good start, but far from representing all the vast range of “Not Terry Gross” speakers that have appeared on other Fresh Air episodes over the years and would likely appear in the “unheard” portion of Fresh Air where we’d like to identify the presence of Terry Gross when we apply the model.  To help address the problem of too few examples of “other” voices in training sets for identifying speakers in the American Archive of Public Broadcasting, we selected a 4000-hour subset of the AAPB’s media collection (6547 audio and video files in all), then extracted two ten-second clips at random from each file. McLaughlin, along with Ryan Blake, a Master’s student at UT Austin’s School of Information, then used the Audio Labeler application to apply 1-second labels within this cross-section of the AAPB. As a result, the AAPB-UBM currently contains 3700 usable 1-second speech clips, each labeled by apparent gender. In addition to the raw WAV audio, we have also posted a set of extracted features (MFCCs + deltas + delta deltas) from vowel segments in these clips, which can imported directly into the the Speaker Identification for Archives (SIDA) speech identification pipeline. We are sharing this AAPB-UBM dataset, which can be combined with users’ own background speaker labels to create more robust classifiers than would otherwise be possible with limited time and training data.

In our next post we will describe and introduce the Speaker Identification for Archives (SIDA) pipeline including a Jupyter notebook that walks through the rest of our speaker identification workflow including feature extraction, model training, and classification.

Machine-labeled segments of Marco Werman’s speech on The World.