Skip to content

A glimpse of Audio Content Identification

The amount of multimedia data being produced in the digital world is overwhelming and, with the increasing degree of interconnection between software and hardware components (read Internet of Things), we can expect it sky-rocketing in the coming years. This Big Data has given rise to a series of new challenging problems such as extracting useful information and retrieving specific content from raw multimedia data. The analysis of audio data and the automatic identification of audio content is one of the challenges that has kept researchers quite busy in recent years. Also, the wide-spread use of automation in basically every industrial sector is requiring more and more intelligent machines capable of automatically identify contents from raw audio/visual signals. In this article i will give a quick overview of Audio Content Identification, a highly inter-disciplinary field that, along with speech recognition, will lead to the realization of intelligent “listening machines” capable of recognizing and understanding high-level contents.

1. Introduction

Audio Content Identification/Recognition (ACI or ACR) commonly refers to the process of analyzing audio signals in order to recognize specific high-level content such as songs, movies, TV shows, commercials, or anything else that is identifiable by means of known information describing the content (metadata). It’s most widespread use case is probably in Music Information Retrieval to match unknown audio signals to known music recordings in a database and retrieve specific data connected with the matching recording. But it can also be used more generally to categorize contents according to some label commonly understood by a human listener, such as classifying audio as music, speech, noise, etc.

ACI is a sub-field of Computer Audition (aka Machine Listening), which aims at emulating the way humans perceive and understand sounds by means of software models. As with Computer Vision, modelling the auditory human cognitive process poses big challenges due to the complexity of such processes, often coupled with our limited knowledge of them. While recognizing specific content that can be objectively defined is generally easier (such as specific songs), some other concepts that are more prone to subjectivity pose additional challenges to ACI. For example, analyzing audio content to match the mood or genre of music may be a very subjective activity, depending on what a person perceives as “melancholy”, “raging”, “rock”, “pop”, etc.

Another example is the recognition of different renditions of a song (covers). An artist’s reinterpretation can be so elaborated that the cover matches the original only in the lyrics (semantic matching) but are perceptually very different. While a human would probably recognize them as the same content (by matching the lyrics in the worst case) an ACI system as intelligent as a human would have to perform a very complex perceptual and semantic analysis in order to search for a match, such as Auditory Scene Analysis, voice/speech recognition and NLP.

As can be seen, the recognition of audio content can be very arduous depending on what is being considered as matching features, and in fact (to date) this is still a very active research field with a lot of room for improvement and many issues yet to be addressed.

2. Sound similarity

ACI is a complex AI problem that may involve several disciplines, such as DSP, Machine Learning, Information Retrieval, NLP, Psychology and Neuroscience. Finding similarities in sounds is not defined by a closed form rule but, as pointed out in the introduction, it’s more of a fuzzy operation. Nonetheless, there are some definitions that can be given in order to describe the problem more specifically. Here we’ll consider the case of perceptual matching regardless of high-level semantic contents. Under this assumption, a definition commonly used in audio processing is that, given two sounds we can say that they are similar if they are perceived in the same way by a human listener. The concept of “in the same way” is quite difficult to define formally. In the human brain this similarity function is some sort of pattern recognition operation where characteristics (features) of the sound match some learned models or, at a higher level, semantic analysis where information is turned into a “concept” with a high degree of confidence.

The process of audio recognition starts in the ear where sound waves from the external world are captured, transduced and turned into neural spikes sent to the brain where they are converted into some internal symbolic representation. The scientific literature is abundant of studies that try to find the best features that are likely to be used by the auditory system to build its internal symbolic models [1].

While many aspects of the higher levels of the auditory system are still unknown, the pre-processing stage of signal acquisition, transduction and transmission are very well documented and mostly understood. This is the reason why most approaches in audio recognition are based on mathematical models representing the early stages of the auditory system (particularly the inner ear), where sound is decomposed into its frequency components and converted into a pattern of some form of energy evolving in time, which roughly mimics the firing activity of the auditory neurons, as for example in [2] and other similar approaches.

The most used representation is the Discrete Fourier Transform (more specifically its fast version FFT) usually coupled with a filterbank of some sort, but other representations are possible, such as the cochleogram, chromagram, DWT, etc. Information extracted from these representations are then compared to models previously built and used for recognition. The main problem when performing audio recognition (but it’s true for other signals as well) is the choice of the feature space that best represents the perceptual characteristics of the sound. This is ideally a low-dimensional vector space of highly discriminative features with a good metric. Unfortunately there is no universal rule to choose the features that best describe the information contents in a signal and it usually requires a thorough investigation starting from the good understanding of the process, statistical analysis and even its psychological implications.

Several methods have been proposed to match the information extracted from an unknown audio sample to the reference models, from simple metrics (Euclidean distances) in vector space-based models to more complex tools such as statistical classifiers (HMM, GMM, etc.) and connectionist models (ANN, SVM, CNN, etc.). The efficacy of these tools for sound similarity depends on the application, but in general vector space-based models are simpler to implement and in many cases can lead to good performances when used to match specific instances of a class of objects (for example specific songs, speech, sounds). However they don’t fully represent the matching process taking place in the human brain and their generalization power is quite limited. In the more general case where a class of objects has to be identified, statistical and connectionist models may be more appropriate to classify sounds, although their implementation is more complex as it requires training, test and validation of the models.

3. Audio Fingerprinting

Audio Fingerprinting is a widely used technique in audio content identification and has gained a lot of commercial interest in recent years due to the increasing number of applications that benefit from this technology. The concept of fingerprint has been around for centuries as a mean to uniquely identify people. Generally speaking, a fingerprint is a piece of information that identifies an entity by means of its unique features with the purpose of recognizing the original entity from unknown data. In the context of ACI the aim is to find a suitable representation of the auditory scene (Rs) and apply some kind of transformation F(Rs) to obtain a fingerprint.

There are a plethora of methods in the literature that use different Rs and F in order to obtain a compact and robust fingerprint model. The representation Rs is usually in a frequency domain and F(Rs) a model in a vector space with a much lower dimensionality compared to the original audio signal’s, but it may also be a graph or some other mathematical structure. Common models are based on spectral features extracted from the DFT of the audio such as spectral flatness, spectral centroid, MFCCs, energy gradients, etc. Some other approaches rely on high-level features that are commonly understood by human listeners, such as harmonicity, melodicity, rhythm, mood, genre. For example, one of the methods using high-level features in music identification is based on the chord progression, which can be obtained by analyzing the harmonic content using a Chromagram as a representation of the auditory scene. There are also hybrid methods that use a combination of both low and high-level features and that is probably the closest model to the inner workings of the auditory system, which uses both low-level features and high-level features along the pathways to the brain.

Regardless of the used features, the fingerprint models found in the literature can be broadly categorized into two classes: Summary fingerprint and Streamed fingerprint.

Summary fingerprint

This is probably the simplest model and is composed of one single vector that summarizes the whole audio scene. In this model the audio is analyzed for perceptually relevant features and a sequence of feature vectors is computed. From this sequence of vectors a multidimensional fingerprint vector is built, where each component represents the aci_fing1summary of a particular feature along the whole audio scene (e.g. means and variances of a k-filter-bank output, average BPM, average ZCR, harmonic key, etc.). The resulting fingerprint vector represents, then, a summary of the entire audio scene. The main advantage of this model is its implementation simplicity, low storage requirement and, therefore, computational efficiency. On the downside, it is not robust to distortions as these are integrated in the summary and can heavily affect the global fingerprint, it is not time-translation invariant as it can’t identify cropped audio and cannot be used for real-time applications. This model is mainly suitable for file identification, which aims at detecting an audio recording by analyzing the content of an entire audio file. It has been used in the first commercial audio fingerprinting services (MusicIP/MusicDNS used a similar model).

Streamed fingerprint

In this model the audio scene is divided into blocks and for each block a set of sub-fingerprints (or local fingerprints) is computed using a specific fingerprint model algorithm. The sequence of these sub-fingerprints represents the fingerprint of the whole audio that’s stored in the database. The definition of a “block” is implementation-specific. Some approaches use fixed size blocks while some others model the audio stream as a sequence of events (musical aci_fing2events in case of songs) so that a block coincides with an event. These events are usually not directly defined and known a-priori but captured and modeled using statistical tools and other machine learning approaches, as for example in [3].
The streamed fingerprint approach has several advantages over the summary fingerprint. It is more robust to noise, it is time-translation invariant allowing identification of audio scenes that are incomplete and are suitable for real-time applications. There is of course a downside: it requires way more storage space and may quickly give raise to scalability issues in large scale applications.

4. Computer Vision and Information Retrieval in Audio Content Identification

It is interesting to see how many computer vision techniques have been successfully applied to audio identification problems. Notably, the works of Ke [4] and Baluja [5] have shown the efficacy of popular computer vision approaches used for visual recognition in solving music identification problems. This idea is motivated by the fact that basically all audio processing problems involve some kind of 2D representation of the original mono-dimensional audio signal, which can be treated as an image. The STFT, cochleogram, chromagram, etc. are typical examples of such representations and the idea of transferring techniques from the 2D world of computer vision to the audio domain to solve very similar problems makes sense. In our audio identification engine Audioneex we’ve also used machine vision models together with machine learning methods successfully as a confirmation of the validity of such approach.

Implementing an audio fingerprint model is, however, only part of the story. A typical audio identification system has the architecture shown in Figure 1. There are many good methods to design robust audio fingerprint models based on perceptual features that are highly discriminative and give excellent accuracy, but it is not enough to make the whole audio identification system efficient. A system using a fingerprint model that is highly accurate but computationally too expensive to the point that it cannot be used for real-time applications is seriously hindered, especially in a fast-paced world where real-time is the norm. The chosen model must also keep into account the algorithm used to search the fingerprint space and find the best match to the unknown audio query.

Figure 1

While a full search over the fingerprint space coupled with a suitable similarity function may be reasonable for limited data sets, it is usually not the case in real-world applications especially in the modern digital era where huge amounts of multimedia data are being produced every day.

For large data sets, a fast and efficient search algorithm is paramount to keep the response times within acceptable limits, and here is where techniques from the Information Retrieval (IR) field come very handy. There is a vast literature on methods to implement efficient strategies for document retrieval in large data sets that can be used in the audio content identification domain. Some of the most popular use word-based or vector-based models coupled with N-gram Decomposition, Locality Sensitive Hashing (LSH) or other nearest neighbor techniques aimed at reducing the dimensionality of the search space. Any of these methods can be used by adapting the objects to be retrieved (usually represented by some kind of descriptor) to the paradigm used by the specific IR technique in order to map typical concepts of IR such as term, vocabulary, word, n-gram etc. to something equivalent in the audio model being used.

5. Use cases

When the first audio identification systems came out there was quite a bit of skepticism in the industry, questioning about their practical usefulness in real world applications other than for music recognition (for which almost all have been developed). But, as it’s usual in technology, progress will eventually create new services and products that will greatly expand the application camp of an existing technology, so now there are many examples of real-world uses of ACI systems in several fields, some of which are described below.

  • Automatic content discovery – Finding information about unknown media contents is probably the most common use case. Initially aimed at music, nowadays ACI is used to discover new content from any media to which an audio track is attached (movies, commercials, TV/Radio programs, etc.).
  • Multimedia content management – For restructuring and organizing multimedia libraries ACI can be used to get complete metadata for audio/visual contents that are untagged or incorrectly tagged.
  • Piracy detection – ACI technology is widely used to detect illegal contents broadcast or shared over private (campus, workplaces, etc.) and public (internet) networks.
  • Copyrights management – Monitoring broadcast contents (radio, TV, Web) to identify copyrighted materials and manage royalties collection is a common use case of ACI.
  • Audio triggers – Applications can use ACI technology to detect audio contents and synchronize specific audio/video events to in-app functionalities (opening another app, sending a message, controlling another device, etc.).
  • Multi screen – Smart devices can use ACI technology to identify multimedia contents and present the user multiple screens with related content (for example while watching a movie trailer an app can identify it and open a screen listing nearby cinemas where it is showing, a social media fan page, DVD stores, etc.).
  • Audience Metering – Marketing specialists can use ACI to get insights into the audience habits (what they’re listening, watching) to develop more efficient marketing strategies.
  • Safety & Law enforcement – Typical use cases may be in audio surveillance, where ambient audio is monitored in an environment and actions are taken when specific audio is detected (for example sending an alarm signal, starting a procedure, etc.), and sensitive content identification (for example by law enforcement authorities to identify leaked classified audio material).

6. Conclusions

Audio Content Identification has surely come a long way from its early stages finding applications in more than just entertainment. It is an inter-disciplinary field that, despite probably not getting as much buzz as its visual counterpart, has experienced relevant improvement and has found its place in our cloud-based computing world where automated analysis of massive amount of data is becoming the norm. The progress in robotics and automation in general will probably expand its field of application even more and it will be a fundamental component in the creation of artificial intelligence capable of understanding anything it hears.



[1] D. Mitrović, M. Zeppelzauer, C. Breiteneder. “Features for Content-Based Audio Retrieval”

[2] J. Haitsma and T. Kalker. “A highly robust audio fingerprinting system”

[3] P. . “Robust Sound Modeling for Song Detection in Broadcast Audio”

[4] Y. Ke, D. Hoiem, R. Sukthankar. “Computer Vision for Music Identification”

[5] S. Baluja, M. Covell. “Content Fingerprinting Using Wavelets”

[6] Audioneex Audio Content Identification Engine

Published inAudio Processing