Musically-Informed Adaptive Audio Reverberation

(1)

F

ACULDADE DE

E

NGENHARIA DA

U

NIVERSIDADE DO

P

ORTO

Musically-Informed Adaptive

Reverberation

João Paulo Caetano Pereira Carvalheira Neves

Mestrado Integrado em Engenharia Eletrotécnica e de Computadores Supervisor: Professor Rui Luis Nogueira Penha

Co-Supervisor: Dr. Gilberto Bernardes de Almeida

(2)

c

(3)

Resumo

Nesta dissertação será apresentado um novo efeito de reverberação digital capaz de adaptar a sua saída ao contexto harmónico de uma performance musical ao vivo (ou seja, em tempo real). A reverberação proposta está ciente do conteúdo musical do sinal de entrada e filtra a saída de acordo com esta informação. Por outras palavras, o sinal que resulta da reverberação é “adaptado” para coadonar com conteúdo musical da performance. Ao contrário das reverberações tradicionais (não adaptativas), que simulam o fenómeno físico da reflexão de ondas sonoras num espaço fechado [1] – tratando igualmente o conteúdo harmónico de entrada – a nossa implementação abrange um comportamento dinâmico, no contexto dos desenvolvimentos recentes em Adaptive Digital Audio Effects(A-DAFx) [2], e assim evita a nuvem sonora típica de um efeito de reverberação tradicional [3].

A reverberação adaptativa proposta foi implementada como um efeito para performance de guitarra em tempo real. Foi criado um pedal que age como controlador do software desenvolvido. Este pedal controla o nível de harmonicidade da saída resultante. Quando o pedal está total-mente levantado, apenas as notas musicais que têm uma relação mais próxima com o conteúdo musical do sinal de entrada serão permitidas no sinal de saída do algoritmo. Quando o pedal se encontra totalmente pressionado, o algoritmo comporta-se como uma reverberação tradicional, deixando todas as notas passar para o sinal de saída. No âmbito destes dois extremos, é oferecido ao artista um amplo espectro de possibilidades criativas. Ao fornecer a músicos e compositores um efeito de reverberação de áudio que interpreta o conteúdo musical do sinal de entrada e ajusta o seu funcionamento conforme a informação que é obtida procuramos fomentar novas experiências criativas.

Para o desenvolvimento deste sistema foi adoptado o Tonal Interval Space [4], um espaço tonal onde as notas musicais estão organizadas de acordo com a sua relação percetual na musica ocidental. Este espaço tonal permite a representação do conteúdo musical de um sinal de audio. A partir de um sinal áudio de entrada são calculados os chroma vectors, nos quais a energia do conteúdo harmónico do sinal é comprimida numa única oitava. Posteriormente, calculamos os Tonal Interval Vectors(TIVs), que resultam duma transformada discreta de Fourier dos chroma vectors. Os coeficientes da transformada de Fourrier são multiplicados por uma função que atribui um peso a cada um destes coeficientes de modo a que os TIVs estejam de acordo com valores empíricos de consonância [5]. Ao distorcer o espaço dos chroma vectors recorrendo a esta trans-formada de Fourrier ponderada, fazemos com que o Tonal interval Space seja uma representação percetualmente relevante. No espaço de 12 dimensões resultante, as distâncias euclidianas e angu-lares entre TIVs indicam níveis de proximidade percetual. Parafraseando, dois TIVs que distarem pouco neste espaço tonal irão também estar percentualmente próximos, por exemplo neste espaço a nota Dó está mais próxima de Sol do que de Dó sustenido, apesar de as duas primeiras notas estarem separadas 7 semitons e de Dó e Dó sustenido serem apenas separados por 1 semitom.

Usa-se o Tonal Interval Space como uma base para desenvolver um algoritmo que identifica eficientemente, em intervalos discretos de tempo, o "contexto" harmónico de uma performance

(4)

ii

contínua. Depois de representar os TIVs de entrada de áudio, suavizamos a trajetória harmonica do sinal de entrada fazendo a média dos últimos Nadapt TIVs para evitar mudanças súbitas no

comportamento do filtro. De seguida, calculamos a distância das 12 notas da escala cromática ao TIV que resulta da média dos últimos Nadapt TIVs. O número de TIVs que são usados para calcular

o TIV médio, Nadapt, é um parametro do sistema que é controlado pelo utilizador As distâncias

resultantes representam a distancia perceptual entre o conteúdo musical das ultimas Nadaptanalises

e as 12 notas da escala cromática . O nível do pedal é agora utilizado para definir quantas notas musicais irão estar presentes no sinal de saída.

(5)

Abstract

We present a novel digital reverberation effect capable of adapting its output to the harmonic context of a live (i.e., real-time) music performance. The proposed reverberation effect is aware of the harmonic content of an audio input signal, and filters the resulting output accordingly. In other words, the signal outputted by the reverberation is “tuned” to the harmonic content of the performance. Conversely to the traditional (non-adaptive) reverberation effects, which emulate the physical phenomena of sound waves reflecting on enclosed space surfaces [1] — treating all input harmonic content equally — our implementation encompasses a dynamic behaviour, in the context of the recent research line of Adaptive Digital Audio Effect (A-DAFx) [2], and thus avoids the sonic clutter typical of a traditional reverberation effect [3].

The proposed adaptive reverberation is implemented as a live guitar effect, with an hardware interface pedal, which acts as control of the harmonicity level of the resulting output. When fully released only a few harmonic partials related to the input signal are allowed in the output signal, and when is fully pressed the pedal acts as a traditional reverberation. Within these two extremes, a wide spectrum of creative possibilities is offered to the user. In providing musicians and composers with a content-aware audio reverberation effect whose controlled output blends with the harmonicity of an ongoing performance, we aim to foster new creative experiences.

We adopt the perceptually-inspired Tonal Interval Space as an audio-based representation of the harmonic content of an ongoing performance. From an input audio signal, we compute chroma vectors, in which the energy of the signal harmonic content is collapsed into a single octave. After-wards, we compute Tonal Interval Vectors (TIVs), as the discrete Fourier transform (DFT) of the chroma vectors, whose components coefficients are further weighted by empirically consonance ratings [5]. By distorting a the chroma space using a weighted DFT, we make the Tonal Inter-val Space a perceptual relevant representation. In the resulting 12-dimensional space, Euclidean and angular distances between audio input TIVs indicate levels of perceptual proximity. In other words, small distances between two sonorities express their levels of perceptual relatedness (e.g., the C note is closer to G then to B, even though the first interval is separated by 7 semitones and the second by 1 semitone).

We use the Tonal Interval Space as a framework to develop an algorithm that can efficiently identify, at given intervals of time, the harmonic “context” of an ongoing performance. After representing the audio input TIVs, we smooth their trajectory as the mean of the last Nadapt vectors

to avoid sudden changes in the filter’s behaviour. The number of TIVs used to calculate the mean TIV, Nadapt, is an user defined parameter of our system. We then compute the distance of the 12

notes of the chromatic scale from the moving averaged TIV. The resulting distances are finally ranked in an increasing order of the perceptual proximity of the 12 chromatic notes from the input signal. The level of pedal depression is then used as a threshold to defines which pitches are allowed in the output reverberation.

(6)

(7)

Agradecimentos

Dedico este espaço para agradecer a todos os que contribuíram, de qualquer forma, para a elabo-ração desta dissertação. Quero assim começar por agradecer ao Gilberto Bernardes todos os dias, horas e minutos que investiu neste trabalho e em mim. Ele foi o grande alicerce deste trabalho, tanto pela ajuda em melhor entender conceitos fulcrais para esta dissertação como na motivação para querer sempre fazer algo melhor do que já tinha sido feito. Em segundo lugar agradeço ao professor Rui Penha por me ter encaminhado para estas vertentes mais criativas do processamento de sinal e por ter promovido a realização deste trabalho.

Não posso, de forma alguma, deixar de agradecer a todos os que durante estes cinco anos, que findam com esta dissertação, me ajudaram e marcaram o meu percurso. Agradecer todas as noites e dias que dedicamos ao que gostamos e que serão sempre inexplicáveis. Só o sabe quem o vive.

À minha mãe agradeço tudo.

O interesse que tenho pela musica devo-o principalmente ao meu pai, que sempre me proveu com o material necessário para poder fazer qualquer conquista musical. Aqui está a primeira!

Agradeço também ao João Bandeira e ao Ricardo Santos por serem uma constante e serem os amigos que qualquer ser humano quer ter.

Por fim, e como os últimos são os primeiros, deixo aqui escrito um enorme obrigado à Joana por ser uma santa e me aturar todos os dias e mesmo assim continuar a gostar de mim. És essencial.

João Paulo Caetano Pereira

(8)

(9)

“After silence that which comes nearest to expressing the inexpressible is music.”

Aldous Huxley

(10)

(11)

List of Figures

2.1 Some of the areas of influence of DSP. Extracted from [6] . . . 6

2.2 Ideal low-pass filter. . . 7

2.3 Delay implementations. . . 8

2.4 Amplitude modulation with modulation factor α = 0.8 of a sine wave with fre-quency 440Hz by a 12Hz wave . . . 9

2.5 Relationship between the input level and the output level of a compressor. This Figure was extracted from [7]. . . 10

2.6 The typical simplified block diagram of a DAFx. . . 11

2.7 Schroeder’s all-pass filter reverberator. . . 11

2.8 Schroeder’s comb filter reverberator. . . 12

2.9 Schroeder’s reverberation block diagram. . . 12

2.10 Block Diagram representing Moorer’s reverberation. Extracted from [7] . . . 13

2.11 All-pass filter in a lattice topology. . . 14

2.12 Tank topology proposed by Dattorro . . . 14

2.13 A fourth-order feedback delay network. . . 15

2.14 Example of a window with overlap of two blocks. . . 17

2.15 Windows functions for N = 512. . . 17

2.16 The typical simplified block diagram of an ADAFx. . . 18

2.17 Tonal Spaces. . . 21

2.18 A C major chord represented in Harte et al.’s 6-dimension tonal space. . . 22

2.19 A C major chord represented in Bernardes’ et al. Tonal Interval Space. Extracted from [4]. . . 22

3.1 Weighting window. . . 24

3.2 C major chord represented in the TIS. This Figure was extracted from [5] . . . . 26

3.3 The empirical consonance order of triads following [8]. . . 26

3.4 The average spectrum of orchestral instruments. Extracted from [9]. . . 27

3.5 Representation of the euclidean deucand angular dangdistance between two vectors. 29 4.1 Block diagram of the implemented system . . . 32

4.2 2-dimensional illustration of the Tonal Filtering algorithm in the Tonal Interval Space. . . 33

4.3 Harmonics Table for different values of Nharmand Nnotes. . . 36

4.4 The hardware interface of the system. . . 37

4.5 The graphical user interface. . . 37

4.6 Details of the system’s user interface. . . 38

5.1 Spectrogram of a guitar phrase input (top), this input processed by the Moorer reverberation (middle) and the MusikVerb. . . 41

(14)

xii LIST OF FIGURES

5.2 Close-up view of the spectrograms of Figure5.1. . . 41

5.3 Mean roughness for an input of major and minor chords. . . 42

5.4 Mean roughness for a set of guitar phrase. . . 43

5.5 Mean roughness for a set of guitar phrases with different values for Nadapt. . . 44

A.1 The system’s graphical user interface. . . 49

A.2 Pure data code for the generation of harmonics given the desired notes and Nharms 50 A.3 The Pure data patch that calculates the average of NadaptpdT IV s. . . 51

A.4 The Pure data patch that performs the FFT of the signal, modifies its frequency domain representation and outputs the result of the IFFT of the modified signal. . 52

A.5 Pure data code relative to the computation of the cosine distance between an av-erage TIV and the TIVs representing the twelve pitch classes of the twelve tone tempered scale. . . 53

A.6 Pure data code relative to the computation of chroma vectors from an audio input using the HPCP algorithm. . . 54

A.7 Pure data code relative to the filtering of input samples that have insignificant amplitudes. . . 55

(15)

List of Tables

2.1 Table retrieved from [10] containing a taxonomy of audio descriptors. . . 20

3.1 C major chords representation in the chroma space. . . 25

3.2 Composite consonance ratings of dyads consonance. Extracted from [11]. . . 25

3.3 Consonance ranking of common triads retrieved from empirical data, psychoa-coustic models and the tonal interval space. Extracted form [12]. . . 28

3.4 Normalized cosine distance between dyads in the TIVs and chroma vectors c[n] of common dyads. Table adapted from [12]. . . 29

4.1 Frequency resolution relative to number of FFT bins, considering 44100kHz the sampling frequency. . . 35

5.1 Huron empirical consonance values, from [11], and Tonal Interval Space calcu-lated perceptual distance between dyads, and their respective ranking between braces, using the weights wa. The presented Spearman correlation values result

of a correlation between the values of its column and the values from the empiri-cal consonance value. . . 40

(16)

(17)

Abreviaturas e Símbolos

ADAFx Adaptive Digital Audio Effects

DAFx Digital Audio Effects DAW Digital Audio Workstation

dB Decibel

DFT Discrete Fourier Transform DSP Digital Signal Processing FIR Finite Impulse Response GUI Graphical User Interface HPCP Harmonic Pitch Class Profile

Hz Hertz- The unit of frequency of the international system IIR Infinite Impulse Response

m Metre- The base unit for distance of the international system MIDI Musical Instruments Digital Interface

MIR Music Information Retrieval ms A thousandth part of a second TIV Tonal Interval Vector

(18)

(19)

Chapter 1

Introduction

1.1 Context

Music and technology have an intertwined history [13], for instance, instruments like the piano or the violin result from this interesting collaboration between music and technology [14]. The way audio is processed by the environment around it influenced the way music developed through the centuries [15], marking the difference between styles of music around the world. In places where people lived in caves, where sound could resonate, music evolved to be more harmonic, hence leading to the creation of musical scales. On the other hand, music that was developed in open spaces has a more rhythmic nature. The difference between these two realities lies on the presence of reverberation around the places where humans lived [16].

Since the construction of ancient Greek theatres [17], where acoustic space of the theatre was designed to improve vocal clarity and sound level, signal processing techniques have been applied to the design of spaces. The construction of cathedrals also influenced the way music was composed as, until the invention of audio recording techniques, music was highly dependent on the place where it was being played [18]. The reverberation of the cathedrals and churches was part of the compositional process and had great influence in the music that composers produced. These early sound effects implementation might not have a creative intent but they are part of the historic roots of audio effects [16].

In the early 20th century, along with the progress in electronics, new instruments started to be developed. Starting with electromechanical instruments, as the Telharmonium (1897) by Thaddeus Cahill or the Ondes Martenot (1928) by Maurice Martenot, and then with electronic instruments, such as the Thérémin (1924) or the Hammond Organ (1935) [19] this advances in technology al-lowed for new sound realms to be explored. After the second world war, two main research groups were created, one in Paris and another in Cologne. The Parisian group was led by Pierre Scha-effer and was based in the Radiodiffusion Television Francaise studios. The group from Cologne had its main figure in Karlheinz Stockhausen and conducted its research in the Nordwestdeutscher Rundfunk. These two groups used the technology that became available from war to create new

(20)

2 Introduction

electronic instruments and new ways to process sound. This link between electronics and creativ-ity allowed for new audio effects to be created, such as the phaser, the distortion and the wah wah [20]. This bond between technological advances and audio effects continued through the years providing the creative world with new tools to explore new sounds.

The invention of digital systems has enabled for audio to be analysed and for information to be retrieved from the audio signal. Music information retrieval (MIR) is a research area that focusses on obtaining information carried by musical signals (e.g pitch, tempo) [21]. MIR results of the combination of different research fields, like digital signal processing (DSP)1, musicology and computer science. It is growing due to a need of perceptually organizing ever-growing music collections [23] caused by the proliferation of music through the internet. Currently there is also research towards applying the MIR field to DAFx, resulting in more interactive effects [24]. This content-aware DAFx is called ADAFx [2].

ADAFx use the information contained in the input audio to control parameters of a time-varying effect [2] . This content-driven way of working is the basis of the ADAFx. The main goal of this type of effects is to create a communicational creative space between the user and the algorithm so that one influences the other. This symbiosis results in a more natural tool that helps the artists to achieve more expressive performances. The first implementations of ADAFx were effects that were driven by the dynamic variation of the sound (e.g compressor, limiter, gate) [2] but, theoretically, any DAFx can be adapted to be an ADAFx. Of course an additional computational step is necessary for this adaptation as it is needed to retrieve information from the input signal.

1.2 Motivation

Digital signal processing, namely musical audio digital processing is an interesting field that joins creativity and technology. This is an ever growing research field as new technologies are al-ways arising providing creativity with the means to explore new al-ways of processing digital audio. Within the field of DAFx new effects are constantly being developed as the musicians always want to explore new sonic spaces thus expanding their creative boundaries.

Perceptually-driven ADAFx are able to create new and more interactive DAFx. This creates new, more organic interactions in the way the effect works. This type of effects are more flexible and can explore new sonic spaces, as they take the content of the input signal as an input for the DSP algorithm. To this date there has not been many commercial implementations of ADAFx, except for the dynamics processors, so this is a field that thrives for new products to be created.

1.3 Goals

During the work of this dissertation we aim to develop a novel adaptive reverberation in which the tonal properties of an input signal shape a tonal filtering algorithm that aims to reduce the clutter

(21)

1.4 Document Structure 3

that traditional reverberations present. We aspire to create an ADAFx that is a new creative tool that shapes itself to the musical content of a performance. Our main goal is to create and interactive and real-time system that is specially tuned for guitar performances. As our implementation is intended to work with a guitar we aim to create a guitar effect pedal that implements the developed system. We set as a secondary goal the adaptation of the system to a compact one that would fit any guitarist’s pedal board.

1.4 Document Structure

In this dissertation we start, in chapter2, by analysing and reviewing the state of the art of digital audio effects, doing a quick overview of Music Information Retrieval and latter in this chapter we present an analysis of some of the tonal spaces that are on the basis of this work.

In chapter3we present the different characteristics of the Tonal Interval Space, presented by Bernardes et al. [4]. We start by explaining the steps needed to transform musical audio to a perceptually relevant representation (Tonal Interval Vectors), finally we overview the properties of this tonal space.

In chapter4the system developed during the course of this dissertation is detailed. We start by describing the system and then discuss its components.

The system is evaluated in chapter 5. This evaluation is divided in two different phases, namely, the objective evaluation, where we compare the use of our system with other similar systems, and the subjective evaluation, where we comprise the evaluation performed by two expe-rienced guitarists of our system.

Finally, in chapter 6we present the conclusions that were obtained during the course of this work.

(22)

(23)

Chapter 2

An Overview of Audio Effects

An audio effect is a signal processing technique that changes the characteristics of an audio signal [25]. Audio effects are used in a wide range of areas under the umbrella of the creative industries, such as broadcasting, television, film, games, music production and creation [26], where they are adopted to enhance the quality and expressiveness of sound and music signals.

In this chapter, we provide an overview of audio effect in its digital manifestation, as most commonly adopted in recent times [1]. To this end, we start by providing a short overview of the Digital Signal Processing (DSP) techniques and then we introduce recent advances in Digital Audio Effects (DAFx) and Adaptive Audio Effects (ADAFx). Next, we detail several digital rever-berations and filters, at the core of our work. The chapter concludes with an overview of musical information retrieval techniques used to extract attributes from audio signals, and in particular musically-informed tonal spaces and metrics, used as a basis to drive Adaptive Audio Effects (ADAFx) in our work.

2.1 Digital Signal Processing

Digital Signal Processing is the mathematics, the algorithms, and the techniques used to manipu-late signals after they have been converted into a digital form [6]. This research area started with the first digital computers and it has been basilar to the development of technology over the past few decades across multiple domains, as shown in Figure2.1.

To apply DSP algorithms, we need to convert a continuous analogue signal into a digitized form. To obtain this digital manifestation of an audio signal, we discretize its amplitude across the temporal dimension, by sampling its amplitude value at regular intervals of time. The frequency of the sampling is called the sampling rate, Fsand is expressed in Hz. Every_F1

s seconds we sample

the amplitude of the continuous signal and assign a discrete amplitude value to it. This process is called quantization and the amplitude values that can be assigned to a given sample are dependent on the quantization step ∆, as shown in Eq. 2.1in the case of an uniform quantizer. This value is dependent on the number of bits that are used to sample the signal and the reference amplitude,

(24)

6 An Overview of Audio Effects 2

Telephone

Digital Signal Processing

-Voice and data compression -Signal multiplexing -Filtering -Echo reduction

data are irreplaceable; and medical imaging, where lives could be saved. The personal computer revolution of the 1980s and 1990s caused DSP to explode with new applications. Rather than being motivated by military and government needs, DSP was suddenly driven by the commercial marketplace. Anyone who thought they could make money in the rapidly expanding field was suddenly a DSP vendor. DSP reached the public in such products as: mobile telephones, compact disc players, and electronic voice mail. Figure 1-1 illustrates a few of these varied applications.

Military

This technological revolution occurred from the top-down. In the early 1980s, DSP was taught as a graduate level course in electrical engineering. A decade later, DSP had become a standard part of the undergraduate curriculum. Today, DSP is a basic skill needed by scientists and engineers

-Radar -sonar -Ordnance guidance -Secure communication

DSP

Space -Space photograph enhancement -Data compression -Intelligent sensory analysis by

Medical Diagnostic imaging (CT, MRI, _{-Electrocardiogram analysis}

-Medical imge storage/retrieval ultrasound, and others)

Commercial

-Movie special effects -Video conference calling

Industrial

L-

Scientific

-Oil and mineral prospecting -Process monitoring & control -Nondestructive testing

-CAD and design tools

-Earthquake recording & analysis -Data acquisition

-Spectral analysis -Simulation and modeling

~ ~~

FIGURE 1-1

DSP has revolutionized many areas in science and engineering. A few of these diverse applications are shown here.

Figure 2.1: Some of the areas of influence of DSP. Extracted from [6]

Vre f (in Volts) of the quantizer.

∆ = 1

2NbitsVre f (2.1)

A digital audio manifestation of a given signal creates the possibility to apply digital process-ing techniques to change its characteristics. In music, this operation broadly falls into the digital audio effects, which we detail next.

2.2 Digital Audio Effects

Digital audio effects are software tools that take an input audio and modify it using algorithms [1]. Audio effects were made possible with the evolution of electronics and, particularly, with the invention of the modern (digital) computers [7]. With modern computation, a new paradigm of audio processing, the digital audio, emerged. The exponential advancements of computation were followed by an increment of digital audio effects, mainly due to the portability and scalability that the digital manifestation fosters. What then needed different hardware systems to work can now be squeezed into one machine. This made effects easily accessible, as most DAFx can use the processing power of the CPU to be adopted in common platforms for processing audio in offline and online scenarios, such as Digital Audio Workstation (DAW) or guitar effects pedals. By using digital signal processing, some of the limitations that existed for analogue audio effects [27] can be surpassed thus pushing the boundaries of audio effects further. The non linear behaviour of electronics distorted the input signal in ways that were not intended in some audio effects. This issue was settled by digital audio as it grants high fidelity, dynamic range and robustness [28].

(25)

2.2 Digital Audio Effects 7

As audio effects are a creative technology, they present an extensive diversity. By using an audio effects one can alter an input audio signal in a multitude of ways. While some effects explore the time properties of the signal, others process their amplitude or their frequency. Zolzer [1] divides DAFx in the following ten categories:

• Filters (e.g, low-pass filters, wha-wha) • Delays (e.g, echo, Multitap delay)

• Modulators (e.g, tremolo, ring modulator)

• Non-linear processing (e.g, compressor, distortion) • Spatial effects (e.g, reverberation, stereo enhancement) • Time-segment processing (e.g, pitch shifter, time-stretcher) • Time-frequency processing (e.g, de-noising, robotization)

• Source-filter processing (e.g, formant changer, pitch shifter with formant preservation) • Spectral processing (e.g, spectral filtering, harmonizer)

• Time and frequency warping (e.g, sound morphing, inharmonizer)

By inspecting the categories above we notice the various fields of application of DAFx as they can either be used to enhance the quality of the signal or explore creatively the signal properties. Next, we present a brief overview of some of the stated DAFx categories.

Filters: a digital audio representation of a signal carries basic frequency and amplitude in-formation, which can be manipulated using digital filters. To this end, this category of effects processes sound partials in order to eliminate, retain or emphasize particular frequencies [1]. In image2.2, we show an ideal low-pass filter. This type of filter eliminates all frequencies below fc

from the audio signal. This category of DAFx is at the basis of many types of audio effects [7] and will be explained in greater detail in Section2.2.5.

Frequency

Gain

1

fc

(26)

8 An Overview of Audio Effects

Delays: sound reflection occurs when a sound wave intersects an obstacle (e.g. a wall). If the distance between the sound emitter device and the obstacle is longer than approximately 17m the reflection is perceived as a repetition of the original sound, dubbed as an echo. DAFx explore commonly this physical phenomena per se using two basic signal processing algorithms: Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) comb filters.

A FIR comb filter is a delay network typology that simulates a single delay by imposing a time, and possible amplitude, difference between the input sound and its playback. The effect is only perceived when the input source sound is heard, or played back by the system.

IIR comb filter based delays produce a succession of repetitions of the input audio, as opposed to the single repetition of a FIR comb filter. The repetitions of a IIR comb filter fade accordingly to the multiplication of the amplitude of the signal by a defined gain of the delay block output (see Figure4.3b). If the gain is higher than 1 the signal repetitions will continue indefinitely and its amplitude will increase over time. If the gain is lower than 1 the system will gradually reach silence after a number of repetitions.

With combinations of these two typologies it is possible to create other delay systems, like the Slapback delay that results in a FIR filter with a small delay time typically between 60ms and 150ms, or the Multitap delay, that uses a combination of FIR or IIR comb filters with different delay times where each of the delay units will contribute of the output result [7], which creates a sensation of sound arriving at different times to the listener.

×

+

x[n] y[n] gain Delay (in samples)

(a) FIR Comb Filter.

+

x[n] y[n] Delay (in samples)

×

gain

×

gain

(b) IIR Comb Filter.

Figure 2.3: Delay implementations.

Modulators: are the processes where one or more properties of a signal, called the carrier signal, is varied by another (modulating) signal [7]. Any parameter of a sound can be modulated by another signal. One of the most common modulations is the amplitude modulation, which results from the multiplication of an input audio signal by a another signal carrying the amplitude variation to be imposed to the original signal over time. Different effects are produced by this DAFx category. One thing that distinguishes them is the frequency of the modulator signal. We will first present a case where the frequency of the modulator is not audible (i.e., less than 20Hz), an then an audio effect resulting of a modulator signal with a frequency higher than 20Hz. The strength of the modulation, that defines how much the signal will vary, is defined by the modulation factor, α.

A classic DAFx example of the use of amplitude modulation is the tremolo. This effect makes the input audio signal amplitude to constantly vary over time. Thus, a long flat sound without am-plitude changes will result in an undulating sound. To achieve this effect we perform the operation

(27)

that is described in Equation2.2.

y[n] = x[n](1 + αm[n]) (2.2)

The modulator signal, m[n], can be of any type. The most common modulating waves are the square wave (Figure2.4b), that will produce a sound were the audio signal is switched on and of (without fading), and the sinusoidal Figure2.4a), that will result in an output audio signal where the sound will go on and of smoothly.

0 0.05 0.1 0.15 0.2 0.25 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Time (in seconds)

Amplitude

(a) Amplitude modulation using a sine wave as modulator m[n]. 0 0.05 0.1 0.15 0.2 0.25 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Time (in seconds)

Amplitude

(b) Amplitude modulation using a square wave as modulator m[n]

Figure 2.4: Amplitude modulation with modulation factor α = 0.8 of a sine wave with frequency 440Hz by a 12Hz wave .

When the modulating signal is audible (i.e., when its greater than 20Hz) the effect obtained when using amplitude modulation is a Ring modulator. This effect will be perceived as a timbrical modification of the input sound and when the input sound, x[n] is a sinusoidal wave and the mod-ulator is also a wave of this type the resulting output signal will be perceived as the sum and the difference of the x[n] and m[n] frequencies.

Non-linear processing: all audio effects that apply a non-linear transformations to the input sound fall in this DAFx category. The distortion or the compressor are examples of non-linear DAFx.

The compressor is a widely used DAFx that reduces the dynamic range of an input sound. It applies a gain, g, to an audio signal dynamically over time, whenever the amplitude of the signal is higher than a user-defined threshold. The level of compression is defined by a compression ratio. For example, a 3:1 ratio means that the input audio has to increase 3 dB for the output to increase 1 dB[7]. The relationship between the input and output signal in a compression DAFx, defined by the compression ratio, is represented in shown in Figure2.5. The time that the compressor takes to apply the determined gain to the input signal is called the attack time. The time the compression algorithm takes to bring the gain back up is called the release time.

As shown in Figure2.5, in the extreme compression ratio ∞:1 the resulting transformation is called limiter. This effect hard limits the input signal level. It can be used to provide control over the peaks of an audio signal, by reducing the dynamics of the input.

(28)

142 Audio Effects: Theory, Implementation and Application

A compressor is essentially a variable gain control, where the amount of gain used depends on the level of the input. Attenuation is applied (gain less than 1) when the signal level is high, which in turn makes louder passages softer, reducing the dynamic range. The basic scheme, implemented as either a feedforward or a feedback device, is shown in Figure 6.1.

Threshold defines the level above which the compressor is active. Whenever the signal level overshoots this threshold, the level will be reduced.

Ratio controls the input/ output ratio for signals overshooting the threshold level. It determines the amount of compression applied. A ratio of 3:1 implies that the input level needs to increase by 3 dB for the output to rise by 1 dB.

A compressor’s input/ output relationship is often described by a simple graph, as in Figure 6.2. The horizontal axis corresponds to the input sig-nal level, and the vertical axis is the output level, where level is measured in decibels. A line at 45° through the origin corresponds to a gain of 1, so that any input level is mapped to exactly the same output level. The ratio

X Gain

Control DetectorLevel X

Level

Detector ControlGain y[n] x[n] (a) y[n] x[n] (b) FIGURE 6.1

Flow diagram of a feedforward (a) and feedback (b) compressor.

Input Level (dB) No compression 2:1 compression 3:1 compression ∞:1 compression (limiting) Output Level (dB ) Threshold FIGURE 6.2

Compressor input/ output characteristic. The compressor weakens the signal only when it is above the threshold. Above that threshold, a change in the input level produces a smaller change in the output level.

Figure 2.5: Relationship between the input level and the output level of a compressor. This Figure was extracted from [7].

Other amplitude transformations can be applied to signal such as the ones that induce distortion to the signal. Distortion, overdrive and fuzz effects, are widely used in music production and became highly popular in rock music [1]. These type of effects change the timbre of the audio input by adding higher harmonics to it [7]. Many implemented of these effects exist, particularly as guitar pedals which help guitar players to achieve idiosyncratic timbres. These three effects process the sound according to a non linear functions, such as: [1]:

f(x) =      2x 0 ≤ x <1₃ 3 −(2−3x)₃ 2 1₃≤ x < 2 3 1 2₃≤ x ≤ 1 (2.3)

Overdrive is the "softer" of these three effects. They apply an almost linear transformation to the signal. The non linear processing only occurs in high amplitude values of the input audio signal as detailed in Equation2.3.

Distortion and fuzz effects are two other DAFx which typically process an audio input by a non-linear function. The different shades of timbres resulting from distortion effects range from the emulation of a tube amplifier to a buzz saw effect [1]. Equation2.4, presented in [1], proposes such distortion system.

f(x) = x |x|(1 − e

x2

|x|₎ _(2.4)

Spatial Effects: our brain can locate the origin of an acoustic source as a result of how an audio signal reaches our ears[29]. We can apply DAFx to impose spatial changes the position of an audio signal. Most microphones record one audio signal but most recorded songs are presented to the user in stereo sound, meaning two synchronized audio streams. This category of DAFx is commonly used by audio producers to map audio signals to different locations in the (virtual) space. The spacial mapping of audio is also a common practice in sound design for multimedia. Although the example that was given refers only to two audio channels, audio spatialization can

(29)

DAFx

User Controlled parameters

Input Digital Audio Output Digital Audio

Figure 2.6: The typical simplified block diagram of a DAFx.

be done in more complex audio systems. An example of this is Ambisonics systems. This type of system comprises methods for both sound recording and reproduction in three dimensions. In this category, also falls the different types of reverberation effects, to which the next Section is entirely devoted, due to the relevance of the topic for the study.

2.2.1 Reverberation

Reverberation effects simulate the acoustic response of a enclosed space using digital signal pro-cessing. In greater detail, it replicates, for example, the shape, size and texture of a room in which the signal is virtually placed [1]. Reverberations are one of the most commonly used DAFx [7]. Conversely, digital audio provides systems with lower signal to noise ratio, low distortion and flat frequency response, which minimizes the problems that electronic reverberators have. Addition-ally, digital audio offers the possibility to implement more complex reverberation systems that can better recreate the desired acoustic response of a room (e.g, convolution with the room’s impulse response and feed-back delay networks).

2.2.2 Schroeder’s Reverberation

Schroeder presented one of the first artificial reverberation systems [27], which relies on two main components: the all-pass filter and the comb filter.

+

x[n]

_y[n]

Delay (in samples) gain 1-g² gain g

+

gain -g

(30)

The all-pass filter, shown in Figure 2.7, simulates sound reflections that do not change the timbre of the input sound [7]. By connecting an arbitrary number of all-pass filters, we obtain a high density reverberation [27] that does not change the timbre of the sound. To calculate the reverberation time, trev, dependent on the gain, g, and the delay time, τ, we can use the following

Equation [27]: trev= 3 log|1 g| · τ (2.5)

Given that the frequency response of a room is not linear, Schroeder added a parallel array of comb filters to "colourize" the audio output. The block diagram of the comb filter used by Schroeder is presented in Figure2.8.

x[n]

Delay

y[n]

(in samples) gain

g

+

Figure 2.8: Schroeder’s comb filter reverberator.

Combining four parallel comb filters with two all-pass filters, we obtain a Schroeder’s rever-beration. The design of this reverberation does not include early reflections, thus neglects impor-tant attributes in the acoustic response of a room. The block diagram of a Schroeder’s reverberation is shown in Figure2.9.

comb

ﬁlter

comb

ﬁlter

comb

ﬁlter

comb

ﬁlter

all-pass

ﬁlter

all-pass

ﬁlter

x[n]

y[n]

(31)

2.2.3 Moorer’s Reverberation

Following Schroeder, the next main breakthrough in reverberation design was presented by Moorer [30]. This system replicates with higher reliability the acoustics of a room as it simulates the phenomena of the absorption of the reverberation by the air and walls in an enclosed space that is perceived as an attenuation of the higher frequencies of the reverberation [7]. Figure2.10 illustrates this reverberation system.

262 Audio Effects: Theory, Implementation and Application

of the reverb to be attenuated at high frequencies, corresponding to absorp-tion by the air and walls. To solve this problem, Moorer embedded a low- pass filter in the feedback loop of each comb filter. Now reverberation time is a function of frequency. This helps simulate a more natural- sounding rever-beration, avoiding the unnatural metallic sound that can be observed with-out the presence of the filter. A simple form of low- pass filter was sufficient to get satisfactory results. An allpass filter then followed the low- pass comb filters, in order to increase the reflection intensity.

Moorer’s reverberator is depicted in Figure 11.5. The top block takes care of early reflections, and the bottom block, consisting of six parallel comb filters with different delay lengths, takes care of late reflections.

Generating Reverberation with the Image Source Method Background

The reverberation heard when a source is recorded in a room is character-ized by the room impulse response (RIR), which gives the impulse response corresponding to a sound source and a listener in a room.

LP Comb LP Comb LP Comb LP Comb LP Comb LP Comb Allpass + + Early reflections Late reflections X(z) Y(z) + z–d1 _z–d2 _z–d3 _z–dn z–d c₁ c₂ c₃ c_n … FIGURE 11.5

Moorer’s reverberator used an FIR filter to simulate the early reflections and parallel comb filters with low- pass filtering to simulate the late reflections.

Figure 2.10: Block Diagram representing Moorer’s reverberation. Extracted from [7] Moorer’s reverberation has two different stages, the early reflections stage and the late re-flections stage. The former consists of a number of FIR filters with variable delay times, which simulate the early impulse response of a room. The latter stage consists of a bank of comb and low-pass filters, which model the high frequency attenuation of the room and increase the density of the reverberation.

2.2.4 Datorro’s Reverberation

In 1997 [31] Dattorro described a reverberation system based on the work of Griesinger [32] This reverberation implementation grants a natural sounding decay that approximately follows an exponential curve [31].

There are two main types of processing blocks in Datorro’s reverberation: the input diffusers and the tank. The former consists of a number of all-pass filter in the topology of a lattice (see Figure2.11). The algorithm uses four of these blocks to decorrelate the input signal before it is

(32)

fed to the tank block [31]. The latter tank blocks aim to reduce the strong amplitude peaks of the input so that the algorithm’s output remains stable.

gain g

+

gain g

z

-M

-Figure 2.11: All-pass filter in a lattice topology.

The tank block, depicted in Figure2.12, defines the amplitude envelope of the decay of the reverberation. It traps the sound using the recirculating diffusers that are in a topology of figure of eight creating an exponential-like decay in the sound that its output from this block.

gain g

+

gain g z-M gain g

+

gain g z-M z-M gain g gain g

+

z-1 damping damping gain g decay gain g

+

gain g z-M gain g

+

gain g z-M z-M gain g gain g

+

z-1 damping damping gain g decay gain g z-M gain _z-M g Input Input

Figure 2.12: Tank topology proposed by Dattorro

In [31], Dattorro proposes a stereo output configuration for this algorithm that results in the simulation of a plate reverberation, i.e., a type of reverberation were a metal plate is used to produce the desired reverb sound. Yet, multiple output configurations can be obtained from this algorithm as Dattorro states in [31].

(33)

2.2.4.1 Feedback Delay Networks

First proposed by Puckette and Stautner [33], the feedback delay networks reverberation consists of a chain of parallel delay lines in a feedback loop with a gain matrix. The input of each delay line is the sum of the input audio and the output from selected delay lines, which conform to a gain matrix. Equation2.6presents an example of a gain matrix, proposed in [33]. The values of g are limited to −1 ≤ g ≤ 1 to ensure the stability of the system. This matrix works as router linking the output of the delay lines to particular input channels. It also applies a gain, gj,k, in this feedback

process. The delay, mj, of each line defines the size of the emulated room. [33]. In Figure2.13,

we show an example of a fourth-order feedback delay network.

G=       0 1 1 0 −1 0 0 −1 1 0 0 −1 0 1 −1 0       · g , g≤√1 2 (2.6)

In [34], Jot extended the Puckette and Stautner’s reverberation by adding filters and showing how to control the poles of the system to better simulate the long-term response of a room.

z

-m1

z

-m2

z

-m3

z

-m4

+

g1,1 g2,1 g3,1 g4,1 g1,2 g2,2 g3,2 g4,2 g1,3 g2,3 g3,3 g4,3 g1,4 g2,4 g3,4 g4,4

x[n]

y[n]

Figure 2.13: A fourth-order feedback delay network.

2.2.4.2 Impulse Response Reverberation

From a DSP background, we can consider the interaction between an audio signal, a room and a listener as a system. In fact, this system is constituted by an input signal, a processing block and an output signal. The room is a static system that is defined by its size and the materials used in its construction, therefore, the room can be considered a linear time-invariant (LTI) system [35],

(34)

thus a convolution between the impulse response of the room and the input signal will result in the output signal. This method is the one that can better simulate the sound of room [1]. Measuring the impulse response of a room is, in itself, a research topic [36] but, as we are not going to replicate the sonic behaviour of a room, this will not be a big concern of this work.

After having the discrete impulse response h[n] we simply convolute it with the input signal x[n] and obtain an output signal y[n] [32]. This operation is shown in the Equation2.7.

y[n] =

_∑

k=0

x[n] × h[n − k] (2.7)

Another, more efficient, method to perform a convolution is using the Discrete Fourier Trans-form (DFT) [37]. This method performs the convolution in the frequency domain, which cor-responds to a multiplication between the time-frequency representation of the input signal with the room’s impulse response time-frequency representation. By using the Fast Fourier Transform (FFT) algorithm it is possible to design a real-time reverberation by impulse response convolution system, as seen in [37].

2.2.5 Filtering Digital Audio

A Filter is a tool that is used to select certain elements of a larger set of elements [1]. Applying this concept to the frequency domain of audio, a filter selects some of the frequencies that are present in the original sound. With DSP it is possible to implement filters that have very good performance [6]. Filters have been implemented in analogue and digital audio. The main types of audio filters are the following [7]:

High-pass Filter: This category of filters removes the frequencies that are lower than a defined cut frequency fc.

Low-pass Filter: Low-pass Filters work as the opposite the first filter class that was described. They remove frequencies that are higher than fcfrom the input signal.

Band-pass Filter: This filter type filters out the frequencies that are distant to a central fre-quency, fc. The band of frequencies ∆ f that will not be filtered can be defined by the quality factor

Q(that can be calculated using Equation2.8) of the filter implementation.

Band-stop Filter: Band-stop filters process the signal by removing the frequencies that are within a band around fc can be described by the quality factor Q of the filter.

Q= fc

∆ f ⇔ ∆ f = fc

Q (2.8)

In our implementation we are interested in selecting multiple frequencies from the input signal. It is interesting to find a filtering implementation that easily allows for this.

2.2.5.1 Spectral Filtering

This filtering technique is a particular case of the phase vocoder [38]. It works by changing the magnitude of the coefficients of the Fourier transform of the input signal.

(35)

By using the Fourier transform to depict the input signal in the frequency domain we are the able to filter frequencies from an audio input signal, in a relatively simple way, and then do the inverse transform and obtain a time domain representation of the filtered signal, though there are some idiosyncrasies to this method that we will address next.

To perform this operation the audio input signal is segmented into discrete blocks and then its frequency domain representation is calculated by performing the short-time Fourier transform block by block. To ensure a better final result each block can contribute to more than one window of the transform. In Figure2.14, we see an example were the window size is M and the hop size, H, is half of the window size.

Input block n-1 Input block n Window m Input block n-2 Input block n-1 Window m-1

Figure 2.14: Example of a window with overlap of two blocks.

To remove unwanted artefacts that exist in the end of a window, due to the abrupt cut of the input blocks, the window is multiplied by a window function. In Figure2.15we see two examples of this type of functions.

0 50 100 150 200 250 300 350 400 450 500 0 0.2 0.4 0.6 0.8 1

(a) Hamming window.

0 50 100 150 200 250 300 350 400 450 500 0 0.2 0.4 0.6 0.8 1 (b) Hann window.

Figure 2.15: Windows functions for N = 512.

After the FFT of size M is performed, we obtain M coefficients that represent the frequency representation of the signal. By changing one of the M coefficients of the signal we will change the presence of the frequency that it represents in the signal. After all of the processing of the signal is done the inverse transform must be performed to create a new version of the signal that does not contain the selected frequency [7].

(36)

DAFx

User Controlled parameters

Input Digital Audio Output Digital Audio

MIR

Figure 2.16: The typical simplified block diagram of an ADAFx.

2.3 ADAFx

Users can change the behaviour of a DAFx over time by controlling its parameters. A simple example of this type of control is the wha-wha pedal, whose filter frequency used to impose a given effect in the sound is controlled by the pressure the user applies on the pedal. A recent subclass of DAFx in which, the input audio controls the parameters of the effect has been flourishing within the community. This class of effects are commonly referred to as Adaptive Audio Effects (ADAFx) [2]. This type of DAFx requires the extraction of information from the signal, such as the amplitude envelope, the fundamental frequency, the dynamics, the musical notes and the tempo.

This type of DAFx enables the user to control the effect using audio signals [39]. The resulting algorithm is one that provides an organic interaction and integration between the performed music and the effect parameters [24]. Hence, using an ADAFx algorithmic architecture, presented in Fig-ure2.16, we can revisit existing effects, by studying mappings between retrieved sound attributes and effect parameters. An example of the adaptation of a common DAFx to an ADAFx is the Adaptiverb1by Zynaptiq. This effect is an ADAFx that extends a common reverberation, creating a new concept of reverberation. In [24], Verfaille and Arfib propose several categories for ADAFx, namely, effects on the sound level (e.g., compressor, limiter), effects on time duration (e.g., adap-tive time stretching), effects on pitch (e.g., adapadap-tive pitch-shift, adapadap-tive vibrato), effects on timbre (e.g., adaptive filtering effects) and effects on panoramisation (e.g., adaptive spatialization).

As it has been noted in this section, ADAFx can be implemented by adding a music infor-mation retrieval and mapping step to a regular DAFx, thus creating countless possibilities for the creation of innovative effects. In the next section we will address how to retrieve relevant infor-mation from sound, that can be used in the adaptive stage of ADAFx.

2.4 Describing Sound

Describing sound involves the use of algorithms, techniques and information that have been de-veloped in different research areas (i.e. Digital signal processing and musicology) [40]. Sound

(37)

2.4 Describing Sound 19

descriptors have been applied across multiple MIR tasks [40], such as database management (i.e. search by similarity in applications as Spotify2), music creation and production (to create new sounds and to more easily edit some sound features), music identification (i.e. Shazam3or Sound-hound4), or even copyright protection.

The current state of the art in sound description commonly adopts descriptors which can be grouped into three levels of abstraction: low, mid and high (or semantic) level descriptors. This division is commonly related to the complexity of the model used to extract them. Low-level descriptors are computed directly from a temporal or spectral representation of the input signal in a derivative fashion (e.g., instantaneous amplitude and fundamental frequency [40]). Mid-level descriptors present information extracted from the signal which relate to perceptual attributes of the signal, as well as concepts commonly adopted by music experts, such as tonality, chords progressions, and tempo [41]. High level descriptors detail semantic attributes of the audio, such as those discussed by non-experts in the field, e.g. the genre or the mood of the music [42].

A large array of sound attributes have been mathematically and algorithmically defined. Pe-teers [10] provides a useful taxonomy of the descriptors that can be obtained from a sound without further processing and groups them into two main categories: global and time varying. The for-mer category use the whole duration of the sound to extract features. The latter group extracts information from the signal on a time frame basis. To achieve higher level representations of the informations more than one of these low level, either global or time varying, descriptors needs to be associated.

The way sound descriptors are extracted also plays a major role in what information they carry. Two main signal representations are used, the time domain, f (t), and the frequency domain, F(s). Some features of the sound may be retrieved using its time-amplitude representation, f (t) (e.g.,the attack, effective duration and the amplitude envelope) albeit this representation lacks to be as perceptual adapted as other techniques [39] as the main perceptual features of the sound are more closely related to the frequency (pitch, timbre) than to the time domain. Frequency domain representations will be discussed latter in this Section.

In Figure2.1, we show some low-level sound descriptors divided by the two mentioned classes. By combining this two basic groups of sound features we can obtain higher level descriptors [43] that represent more abstract characteristics of the sound, like the consonance, the warmth and so on. This higher level features may be of easier understanding for users that are not familiar with the more technical characteristics of sound as they often reflect more common ways to describe sound [40]. So, it is important to define this high-level descriptors in order to build more user friendly and interactive systems.

The frequency representation F(s) of the signal carries, for example, information about the note that is currently being played or the timbre of the instruments playing. This information is, of course, important to build higher level sound descriptors. The most common method to get the

2_{https://www.spotify.com}

3_{https://www.shazam.com}

(38)

frame analysis:ah(tm) andfh(tm). For this, we use a

Black-man window of 100 ms duration and a hop size of 25 ms. It should be noted that this window duration is larger than that used for the computation of the STFT. The reason for this is to obtain a better spectral resolution (separation between ad-jacent spectral peaks), which is required in order to be able to describe harmonics individually and to compute the related harmonic descriptors. In line with Krimphoffet al. (1994)andMisdariiset al. (1998), the number of partialsH is set to 20. This value represents a trade-off, because for a 50 Hz fundamental frequency it covers the range from 20 to 1000 Hz and for a 1000 Hz signal it covers the range from 1000 to 20 000 Hz. This parameter can easily be changed in the Timbre Toolbox.

In our system, the sinusoidal model is used for the esti-mation of harmonic descriptors such as the tristimulus ( Pol-lard and Jansson, 1982) or the odd-to-even harmonic ratio (Caclinet al., 2005). These descriptors require that an order and a number be assigned to the partials (e.g., we need to know which partials are the three first harmonics and which are odd- or even-numbered harmonics). We thus need to define a reference partial, as well as the relation between the partials h and the reference partial. Because of this

con-straint, we cannot use a blind sinusoidal model such as one that will only estimate partials using partial tracking.

We use a harmonic sinusoidal model extended to the slightly inharmonic case (such as for piano sounds), i.e., partials fh(tm) are considered as multiples of a fundamental frequency

f0(tm) or as an inharmonic deformation of a harmonic series.

For this, we define an inharmonicity coefficienta ! 0. The content of the spectrum is now explained by partials at frequen-ciesfhð Þ ¼ ftm 0ð Þhtm

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1_{þ ah}2

p

. In order to estimate the model, we first estimate the fundamental frequency at each frametm.

In the Timbre Toolbox implementation, we use the algorithm proposed byCamacho and Harris (2008). Given thatf0(tm) is an

estimate, we allow a departure from the estimated value, denoted fhð Þ ¼ ftm ð0ð Þtm þd tð ÞÞhm

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1_{þ ah}2

p

. For a given frametm, we then look for the best values ofd(tm) anda (a is

presumed to be constant over frames) such that the energy of the spectrum is best explained. We therefore search for values ofd(tm) anda in order to maximize etm(d, a) defined as

etmðd; aÞ ¼ X h Xtm ðf0ðtmÞ þ dðtmÞÞh ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1þ ah2 p " # ; (2)

whereXtm(f) is the amplitude of the DFT at frequency f and

timetm.

TABLE I. Audio descriptors, corresponding number of dimensions, unit, abbreviation used as the variable name in the MATLAB code and input signal repre-sentation. Units symbols: -¼ no unit (when the descriptor is "normalized"); a ¼ amplitude of audio signal; F ¼ Hz for the Harmonic, STFTmag and STFTpower representations, and ERB-rate units for the ERBfft and ERBgam representations;I¼ a for the STFTmag representation and a2_{for the STFTpow,}

ERBfft and ERBgam representations.

J. Acoust. Soc. Am., Vol. 130, No. 5, November 2011 Peeters et al.: The Timbre Toolbox 2905

Table 2.1: Table retrieved from [10] containing a taxonomy of audio descriptors.

frequency domain representation of a sound is to use the Discrete Fourier Transform (DFT). This mathematical operation works on the sound like a filter bank, as for each considered frequency it filters the sound and retrieves how much that frequency is present in the windowed signal. The windowed signal is obtained by dividing the signal in defined time portions. Then the Fourier transform is obtained by processing each of the time frames.

The DFT is a very useful and widely used strategy in sound synthesis/analysis [44] and in MIR [10] two of the main topics of this work. At this work we will focus on the retrieval of musical notes, that we a consider mid level descriptor.

2.5 Tonal Spaces

A number of tonal pitch representations have been presented in the literature to model human perception [45]. These spaces distort the common frequency scale of musical notes (i.e., the logarithmic frequency scale of the chromatic scale), to obtain a representation where the notes are located according to their perceptual relatedness. To this end, Riemann developed the Tonnetz [46], also called harmonic network shown in Figure2.17a. This tonal space was the foundation for many other tonal spaces, such as Chew’s the Spiral Array [47], Hart’s et al. 6-dimension tonal centroid space [46], and Bernardes et al.’s Tonal Interval Space [5].

(39)

2.5 Tonal Spaces 21

Chew’s Spiral Array is a 3-dimensional representation of the harmonic network [47] resulting from wrapping the horizontal interval sequence of fifths (see Figure2.17a) around an helix, thus forming a spiral, as shown in Figure2.17b. In Chew’s space, chords and keys are represented on the inside of the helix as a point that results from a linear combination of its composite notes. This results in a space where the more perceptually related pitch combinations in Western music match with small euclidean distances. Chew’s Spiral Array has been successfully applied in problems such as pitch spelling [48] and Key estimation [47].

C G D F A A E B F♯ C♯ D♭ A♭ E♭ B♭ F B♭♭ F♭ C♭ G♭ D♭ D B♭ G♭ E♭♭ I IV V iii vi ii viio Major 3rd Minor 3rd Perfect 5th

(a) The Tonnetz tonal space extracted from [4].

(b) Chew’s Spiral Array

ex-tracted from [47].

Figure 2.17: Tonal Spaces.

Harte et al.’s Tonal Centroid Space follows the line of Chew’s work and proposes a space that projects pitch configurations encoded as 12-element chroma vectors as 6-dimensional points can be visualized by 3 circles. Distances between pitch classes in the 6-dimension space mirror the spatial arrangement for the perfect fifth, major thirds, and minor thirds of the Tonnetz, weighted in a similar fashion to Chew’s Spiral Array to favour perfect fifths and minor thirds over major thirds. Conversely to Chew’s work, Harte et al. considers enharmonic5equivalences, allowing for the representation of harmonic information in a single octave [5]. Following Chew, the projection of a group of notes in the Tonal Centroid Space results of the linear combination of the positions of the notes that compose it. Hence, the distances between pith configurations emphasise perceptual relations. This space has been successfully applied in several MIR tasks, such as automatic chord recognition from audio [46] and structural segmentation [49].

Recently, Bernardes et al. presented the Tonal Interval Space [4], in the scope of the previous Tonnetz, Spiral Array, and Tonal Centroid Space. Figure 2.19shows the Tonal Interval Space, a 12-dimensional space that extends previous tonal spaces in four fundamental aspects. First, it represents and relates in the same space tonal pitch at three fundamental levels of Western tonal music, namely pitch, chord and keys. Second, it computes the space by means of the efficient

5_{Merrian-Webster definition of Enharmonic: of, relating to, or being notes that are written differently (such as A flat}

(40)

22 An Overview of Audio Effects 2, 6, 10 0, 4, 8 3, 7, 11 1, 5, 9 Minor Third (M3) 2, 5, 8, 11 1, 4, 7, 10 Major Third (m3) 0 7 2 9 4 11 6 1 8 3 10 5 Perfect Fifth (P5) 0, 3, 6, 9

Figure 2.18: A C major chord represented in Harte et al.’s 6-dimension tonal space.

discrete Fourier transform. Third, by design, it allows the computation of a tonal pitch conso-nance indicator. Fourth, it maps pitch configurations that have a different representation in the chroma vector as unique locations in our space, thus expanding Harte et al.’s 6-dimensional space to include all possible intervallic relations.

4, 10 7, 1 8, 2 5, 11 9, 3 0, 6 2, 6, 10 0, 4, 8 3, 7, 11 1, 5, 9 Major Third (M3) Minor Sixth (m6) Minor Second (m2) Major Seventh (M7) 9 10 11 0 1 2 3 4 5 6 7 8 0, 3, 6, 9 2, 5, 8, 11 1, 4, 7, 10 Minor Third (m3) Major Sixth (M6) 9 2 7 0 5 10 3 8 1 6 11 4 Perfect Fourth (P4) Perfect Fifth (P5) 0, 2, 4, 6, 8, 10 1, 3, 5, 7, 9, 11 Major Second (M2) Minor Seventh (m7) Tritone (TT) y1 y2 y3 y6 y5 y4 x1 x2 x3 x6 x5 x4

Figure 2.19: A C major chord represented in Bernardes’ et al. Tonal Interval Space. Extracted from [4].

In the context of this dissertation, we adopt the Tonal Interval Space as a framework to extract relevant harmonic information from audio signals due to to higher flexibility in processing audio signals (namely, when compared to Chew’s space), as well as the enhanced perceptual relevance of the space in comparison with Harte et al.’s Tonal Centroid Space [4].

(41)

Chapter 3

Tonal Interval Space

In this chapter, we detail the Tonal Interval Space [4]. Tonal pitch configurations manifested as both symbolic and audio are mapped in the Tonal Interval Space as twelve-dimensional Tonal Interval Vectors (TIVs) that result from the discrete Fourier transform (DFT) of normalized chroma vectors.

The resulting space distorts the (logarithmic) frequency scale to a perceptual relevant repre-sentation from which tonal pitch attributes, such as consonance and perceptual relatedness, can be computed.

In this chapter we first describe the computation of chroma vectors from an input audio, then we explain the procedure to transform chroma vectors into the perceptual relevant representation of a tonal interval vectors. At the end of the chapter we explore some useful properties of the Tonal Interval Space.

3.1 Chroma Vectors

The equally-tempered scale is the most commonly used tuning system in western music [50] since the 18th_{century [}₅₁_{]. This scale is obtained by dividing one octave in twelve parts. Musical notes}

can be defined by their frequency given by:

f[n] = f02

n

12 _(3.1)

The reference frequency, fre f, commonly adopts the tuning standard (i.e., A4=440Hz), n is the

number of semitones that the wanted note, f [n], is separated from the reference, fre f. For example,

C5 is separated by 3 semitones from A4, the reference frequency.

Equation 3.1 clarifies one of the most distinguishable properties of this tuning system, i.e., all the adjacent notes have the same interval between them. Although musical notes represent a frequency value this has never been the standard for its representation. Different representation for musical notes have been used [52] (i.e. MIDI, chroma). These can carry more or less information about the performance. Often used to characterize the pitch classes present in an audio signal[51], chroma vectors are a useful tool that enables an easy representation of the musical notes that are

Musically-Informed Adaptive Audio Reverberation

F

E

U

P

Musically-Informed Adaptive

Reverberation

João Paulo Caetano Pereira Carvalheira Neves

Resumo

Abstract

Agradecimentos

Contents

List of Figures

List of Tables

Abreviaturas e Símbolos

Chapter 1

Introduction

1.1

Context

1.2

Motivation

1.3

Goals

1.4

Document Structure

Chapter 2

An Overview of Audio Effects

2.1

Digital Signal Processing

DSP

L-

2.2

Digital Audio Effects

×

+

+

×

×

DAFx

+

x[n]

y[n]

+

x[n]

y[n]

+

comb

ﬁlter

comb

ﬁlter

comb

ﬁlter

comb

ﬁlter

all-pass

ﬁlter

all-pass

ﬁlter

x[n]

y[n]

+

+

z

+

+

+

+

+

+

+

+

+

+

z

z

z

z

+

+

+

_y[n]

_∑