CONCLUSION AND FUTURE WORK - sounds with parameterized neural networks

In summary, we examined the use of a parametrized neural network front-end to learn spectro-temporal modulations optimal for different behavioral tasks. This front-end, the Learnable STRFs, yielded performances close to published state-of-the-art using engineering- oriented neural network for Speaker Verification, Urban Sound Classification, Zebra Finch Call Type Classification, and obtained the best results on two datasets for Speech Activity Detection. As our front-end is fully interpretable, we found markedly different spectro- temporal modulations as a function of the task, showing that each task relies on a specific

set of modulations. These task-specific modulations were globally congruent with previous work based on three approaches: spectro-temporal analysis of different audio signals (Elliott and Theunissen, 2009), analysis of trained neural networks (Sch¨adler et al., 2012), and analysis of the auditory cortex (Hullett et al., 2016; Santoro et al., 2017). In particular, for the Speech Activity Detection task, we observed the same modulation distributions as the ones found directly the human auditory cortex listening while listening to naturalistic speech Hullett et al. (2016). The modulations also displayed generic characteristics across tasks, namely, a predominance of low frequency spectral and temporal modulations and a high degree of ’stariness’ and ’separability’, corresponding to the fact that filters tend to remain close to either the temporal or spectral axis, with low occupation of joint spectro and temporal responses. This is consistent with Singh and Theunissen (2003). In order to encourage reproducible research, the developed Learnable STRFs layer, the learned STRFs modulations, and the recipes to replicate results are available in a open-source package ¹.

Several avenues of extensions are possible for this work, based on what is known in auditory neuroscience. First, this work only modelled the final outcome of plasticity after each task has been fully learned, starting from a random initialization. Yet, the same model could be used to address a range of issues relevant to changes occurring during task learning (top-down plasticity) or due to modification of the distribution of audio input (bottom-up plasticity). Recent workBellur and Elhilali(2015) have investigated the adaptation of modulations in analytical models and witnessed several improvements in terms of engineering performances suggesting that this is also an interesting avenue in terms of behavioral mod- eling. Second, the analyses from Hullett et al. (2016) showed that not only neurons have

spectro-temporal selectivity but are also topographically distributed along the posterior-to- anterior axis in the superior temporal gyrus. In future work, it would be interesting to reproduce such topography by using an auxiliary self-organizing maps objective in addition to the task-specific loss function for the STRFs. Finally, despite their wide use in auditory neuroscience, the spectro-temporal modulations do not provide a complete picture of com- putations in the auditory cortex (Williamsonet al.,2016). A potential extension of our work would be to add an extra layer able to express co-occurrences and anti-occurrences of pairs of spectro-temporal receptive fields as in M lynarski and McDermott (2018, 2019). Such an extra layer would provide a learnable and interpretable extension to spectro-temporal representations.

To conclude, we emphasize that neuroscience-inspired parametrized neural networks can provide models that are both efficient in terms of behavioral tasks and interpretable in terms of auditory signal processing.

ACKNOWLEDGMENTS

This work is funded in part from the Agence Nationale pour la Recherche (ANR-17- EURE-0017 Frontcog, ANR-10-IDEX-0001-02 PSL*, ANR-19-P3IA-0001 PRAIRIE 3IA In- stitute). ACBL was funded through Neuratris, and ED in his EHESS role by Facebook AI Research (Research Gift) and CIFAR (Learning in Minds and Brains). The university (EHESS, CNRS, INRIA, ENS Paris-Sciences Lettres) obtained the datasets reported in this

APPENDIX A:

1. Importance of the representation of the Learnable STRFs

10 0 10 20 30

t 5

0 5 10 15 20 25 30

Speech AMI

10 0 10 20 30

t 5

0 5 10 15 20 25 30

Speech CHIME5

10 0 10 20 30

t 5

0 5 10 15 20 25 30

Speaker

10 0 10 20 30

t 5

0 5 10 15 20 25 30

Urban

0 2 4 6 8 10 12 14

t 5

0 5 10 15 20 25 30

Bird

FIG. 6. Gaussian envelopes (σt, σf) of the Learned STRFs to tackle Speech Activity Detection on the AMI dataset (Speech AMI) and on the CHIME5 (Speech CHIME5), Speaker Verification on VoxCeleb (Speaker), Urban Sound Classification on Urban8k (Urban), Zebra Finch Call Type Classification (Bird). We displayed only a subset of the Learned STFRs of the Bird and Urban tasks for clarity. We also plotted the bi-variate distributions using kernel density estimation for each task.

We performed an additional analysis of Speech Activity Detection of the choice of representations Z from the Learnable STRFs used in the subsequent neural network. The performance of the real part, the imaginary part and absolute values of the filter output are compared. The results are presented in TableV. In comparison with the concatenation of the real and imaginary parts, the performance obtained for each part were in the same

range on the AMI dataset, but were below on the CHIME5 dataset. As in Sch¨adler et al.

(2012), this indicates that phase information contained in the real and imaginary parts is important for the Learnable STRFs.

TABLE V. Speech Activity Detection results for the different uses of the Z for the Learnable STRFs. CL stands for contraction layer, it is a convolution layer reducing the size of the tensor dimension after the convolution (Learnable STRFs). Each input front-end is then fed to a 2-layer BiLSTM and 2 feed-forward layers. The best scores for each metric overall are inbold. MD stands for Missed detection rate. FA stands for False Alarm rate. DetER stands for Detection Error Rate.

For all metrics, lower is better.

Database AMI CHIME5

Metric DetER MD FA DetER MD FA

Input front-end (Learnable STRFs + CL)

Real part<(Z) 5.9 2.4 3.5 20.1 2.6 17.5

Imaginary part=(Z) 5.9 2.2 3.7 22.1 1.0 21.1

Magnitude|Z| 5.9 2.2 3.7 19.8 3.1 16.7

Concatenation [<(Z),=(Z)] 5.8 2.4 3.4 19.2 3.1 16.1

1https://github.com/bootphon/learnable-strf

Alekseev, A., and Bobe, A. (2019). “Gabornet: Gabor filters with learnable parameters in deep convolutional neural network,” pp. 1–4, doi: 10.1109/EnT47717.2019.9030571.

Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G. et al. (2016). “Deep speech 2: End-to-end speech recognition in english and mandarin,” in ICML, pp. 173–182.

Arnault, A., Hanssens, B., and Riche, N. (2020). “Urban sound classification: striving towards a fair comparison,” arXiv preprint arXiv:2010.11805 .

Barker, J., Watanabe, S., Vincent, E., and Trmal, J. (2018). “The fifth ‘chime’ speech separation and recognition challenge: Dataset, task and baselines,” in Proceedings of the 19th Annual Conference of the International Speech Communication Association (INTER- SPEECH 2018), Hyderabad, India.

Bellur, A., and Elhilali, M. (2015). “Detection of speech tokens in noise using adaptive spectrotemporal receptive fields,” in2015 49th Annual Conference on Information Sciences and Systems (CISS), IEEE, pp. 1–6.

Bredin, H. (2017). “pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems,” in Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, http://pyannote.github.io/pyannote-metrics.

Bredin, H., Yin, R., Coria, J. M., Gelly, G., Korshunov, P., Lavechin, M., Fustes, D., Titeux, H., Bouaziz, W., and Gill, M.-P. (2020). “Pyannote. audio: neural building blocks for speaker diarization,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 7124–7128.

Chang, S.-Y., and Morgan, N. (2014). “Robust cnn-based speech recognition with gabor filter kernels,” in Fifteenth annual conference of the international speech communication association.

Cheuk, K. W., Anderson, H., Agres, K., and Herremans, D. (2020). “nnaudio: An on-the- fly gpu audio to spectrogram conversion toolbox using 1d convolutional neural networks,”

IEEE Access In press.

Chi, T., Ru, P., and Shamma, S. A. (2005). “Multiresolution spectrotemporal analysis of complex sounds,” The Journal of the Acoustical Society of America 118(2), 887–906.

Chung, J. S., Nagrani, A., and Zisserman, A. (2018). “Voxceleb2: Deep speaker recognition,” Proc. Interspeech 2018 1086–1090.

Coria, J. M., Bredin, H., Ghannay, S., and Rosset, S. (2020). “A Comparison of Metric Learning Loss Functions for End-To-End Speaker Verification,” inStatistical Language and Speech Processing, edited by L. Espinosa-Anke, C. Mart´ın-Vide, and I. Spasi´c, Springer International Publishing, Cham, pp. 137–148.

Cuturi, M. (2013). “Sinkhorn distances: Lightspeed computation of optimal transport,”

Advances in neural information processing systems 26, 2292–2300.

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and Ouellet, P. (2010). “Front-end fac- tor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing 19(4), 788–798.

Depireux, D. A., Simon, J. Z., Klein, D. J., and Shamma, S. A. (2001). “Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex,”

Journal of neurophysiology 85(3), 1220–1234.

Edraki, A., Chan, W.-Y., Jensen, J., and Fogerty, D. (2019). “Improvement and assessment of spectro-temporal modulation analysis for speech intelligibility estimation,” in Inter- speech 2019Annual Conference of the International Speech Communication Association, ISCA, pp. 1378–1382.

Efron, B., and Tibshirani, R. J. (1994). An introduction to the bootstrap (CRC press).

Elhilali, M., Chi, T., and Shamma, S. A. (2003). “A spectro-temporal modulation index (stmi) for assessment of speech intelligibility,” Speech communication 41(2-3), 331–348.

Elie, J. E., and Theunissen, F. E. (2016). “The vocal repertoire of the domesticated zebra finch: a data-driven approach to decipher the information-bearing acoustic features of communication signals,” Animal cognition 19(2), 285–315.

Elliott, T. M., and Theunissen, F. E. (2009). “The modulation transfer function for speech intelligibility,” PLoS computational biology 5(3).

Espi, M., Fujimoto, M., Kinoshita, K., and Nakatani, T. (2015). “Exploiting spectro- temporal locality in deep learning based acoustic event detection,” EURASIP Journal on Audio, Speech, and Music Processing 2015(1), 1–12.

Ezzat, T., Bouvrie, J., and Poggio, T. (2007). “Spectro-temporal analysis of speech using 2- d gabor filters,” inEighth Annual Conference of the International Speech Communication Association.

Flamary, R., and Courty, N. (2017). “Pot python optimal transport library” https://

pythonot.github.io/.

Francis, N. A., Elgueda, D., Englitz, B., Fritz, J. B., and Shamma, S. A. (2018). “Laminar profile of task-related plasticity in ferret primary auditory cortex,” Scientific reports 8(1),

1–10.

Fritz, J., Shamma, S., Elhilali, M., and Klein, D. (2003). “Rapid task-related plasticity of spectrotemporal receptive fields in primary auditory cortex,” Nature neuroscience 6(11), 1216–1223.

Gabor, D. (1946). “Theory of communication. part 1: The analysis of information,” Journal of the Institution of Electrical Engineers-Part III: Radio and Communication Engineering 93(26), 429–441.

Hullett, P. W., Hamilton, L. S., Mesgarani, N., Schreiner, C. E., and Chang, E. F. (2016).

“Human superior temporal gyrus organization of spectrotemporal modulation tuning de- rived from speech stimuli,” Journal of Neuroscience36(6), 2014–2026.

Imperl, B., Kaˇciˇc, Z., and Horvat, B. (1997). “A study of harmonic features for the speaker recognition,” Speech communication 22(4), 385–402.

Jääskeläinen, I. P., Ahveninen, J., Belliveau, J. W., Raij, T., and Sams, M. (2007). “Short- term plasticity in auditory cognition,” Trends in neurosciences 30(12), 653–661.

Kell, A. J., and McDermott, J. H. (2019). “Deep neural network models of sensory systems:

windows onto the role of task constraints,” Current opinion in neurobiology 55, 121–132.

Kell, A. J., Yamins, D. L., Shook, E. N., Norman-Haignere, S. V., and McDermott, J. H.

(2018). “A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy,” Neuron 98(3), 630–644.

Kingma, D. P., and Ba, J. (2014). “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 .

Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., and Plumbley, M. D. (2020). “Panns:

Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894.

Koumura, T., Terashima, H., and Furukawa, S. (2019). “Cascaded tuning to amplitude modulation for natural sound recognition,” Journal of Neuroscience 39(28), 5517–5533.

Lei, H., Meyer, B. T., and Mirghafori, N. (2012). “Spectro-temporal gabor features for speaker recognition,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 4241–4244.

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. (2019). “On the vari- ance of the adaptive learning rate and beyond,” in International Conference on Learning Representations.

Massoudi, R., Van Wanrooij, M. M., Versnel, H., and Van Opstal, A. J. (2015). “Spec- trotemporal response properties of core auditory cortex neurons in awake monkey,” PLoS One 10(2), e0116118.

McCowan, I., Carletta, J., Kraaij, W., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V.et al.(2005). “The ami meeting corpus,” inProceedings of Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research, Noldus Information Technology, pp. 137–140.

McCracken, K. G., and Sheldon, F. H. (1997). “Avian vocalizations and phylogenetic signal,” Proceedings of the National Academy of Sciences 94(8), 3833–3836.

McDermott, J. H. (2018). “Audition,” Stevens’ Handbook of Experimental Psychology and Cognitive Neuroscience 2, 1–57.

Mesgarani, N., Slaney, M., and Shamma, S. A. (2006). “Discrimination of speech from non- speech based on multiscale spectro-temporal modulations,” IEEE Transactions on Audio, Speech, and Language Processing 14(3), 920–930.

Meyer, A. F., Williamson, R. S., Linden, J. F., and Sahani, M. (2017). “Models of neuronal stimulus-response functions: elaboration, estimation, and evaluation,” Frontiers in systems neuroscience 10, 109.

M lynarski, W., and McDermott, J. H. (2018). “Learning midlevel auditory codes from natural sound statistics,” Neural computation 30(3), 631–669.

M lynarski, W., and McDermott, J. H. (2019). “Ecological origins of perceptual group- ing principles in the auditory system,” Proceedings of the National Academy of Sciences 116(50), 25355–25364.

Nagrani, A., Chung, J. S., and Zisserman, A. (2017). “Voxceleb: a large-scale speaker identification dataset,” Telephony 3, 33–039.

Ondel, L., Li, R., Sell, G., and Hermansky, H. (2019). “Deriving spectro-temporal properties of hearing from speech data,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 411–415.

Peyr´e, G., Cuturi, M. et al. (2019). “Computational optimal transport: With applications to data science,” Foundations and Trends®in Machine Learning 11(5-6), 355–607.

Pillow, J., and Sahani, M. (2019). “Editorial overview: Machine learning, big data, and neuroscience” .

Ravanelli, M., and Bengio, Y. (2018). “Speaker recognition from raw waveform with sinc- net,” in 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp. 1021–1028.

Saddler, M. R., Gonzalez, R., and McDermott, J. H. (2020). “Deep neural network models reveal interplay of peripheral coding and stimulus statistics in pitch perception,” bioRxiv .

Salamon, J., and Bello, J. P. (2017). “Deep convolutional neural networks and data aug- mentation for environmental sound classification,” IEEE Signal Processing Letters 24(3), 279–283.

Salamon, J., Jacoby, C., and Bello, J. P. (2014). “A dataset and taxonomy for urban sound research,” in Proceedings of the 22nd ACM international conference on Multimedia, pp.

1041–1044.

Santoro, R., Moerel, M., De Martino, F., Valente, G., Ugurbil, K., Yacoub, E., and Formisano, E. (2017). “Reconstructing the spectrotemporal modulations of real-life sounds from fmri response patterns,” Proceedings of the National Academy of Sciences 114(18), 4799–4804.

Sch¨adler, M. R., Meyer, B. T., and Kollmeier, B. (2012). “Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition,” The Jour- nal of the Acoustical Society of America 131(5), 4134–4151.

Shamma, S. A. (1996). “Auditory cortical representation of complex acoustic spectra as inferred from the ripple analysis method,” Network: Computation in Neural Systems7(3), 439–476.

Singh, N. C., and Theunissen, F. E. (2003). “Modulation spectra of natural sounds and etho- logical theories of auditory processing,” The Journal of the Acoustical Society of America 114(6), 3394–3411.

Snyder, D., Chen, G., and Povey, D. (2015). “Musan: A music, speech, and noise corpus,”

arXiv preprint arXiv:1510.08484 .

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. (2018). “X-vectors:

Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 5329–5333.

Stevens, S. S., Volkmann, J., and Newman, E. B. (1937). “A scale for the measurement of the psychological magnitude pitch,” The Journal of the Acoustical Society of America 8(3), 185–190.

Tanno, R., Arulkumaran, K., Alexander, D., Criminisi, A., and Nori, A. (2019). “Adaptive neural trees,” in International Conference on Machine Learning, PMLR, pp. 6166–6175.

Thoret, E., Andrillon, T., L´eger, D., and Pressnitzer, D. (2020). “Probing machine-learning classifiers using noise, bubbles, and reverse correlation,” BioRxiv .

Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2016). “Instance normalization: The missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022 .

Vuong, T., Xia, Y., and Stern, R. M. (2020). “Learnable Spectro-Temporal Receptive Fields for Robust Voice Type Discrimination,” in Proc. Interspeech 2020, pp. 1957–1961, doi:

10.21437/Interspeech.2020-1878.

Williamson, R. S., Ahrens, M. B., Linden, J. F., and Sahani, M. (2016). “Input-specific gain modulation by local sensory context shapes cortical and thalamic responses to complex sounds,” Neuron 91(2), 467–481.

Woolley, S. M., Fremouw, T. E., Hsu, A., and Theunissen, F. E. (2005). “Tuning for spectro- temporal modulations as a mechanism for auditory discrimination of natural sounds,”

No documento sounds with parameterized neural networks (páginas 30-43)