Nome e qualifica del proponente del progetto: 
sb_p_1709192
Anno: 
2019
Abstract: 

Next-generation technologies, ranging from driverless cars to immersive virtual reality, are expected to understand and analyse the surrounding world through a range of high-resolution sensors. In particular, 3D audio sensors will endow them with a clear spatial sense of the environment on-par with the auditory human system. Exploiting this information can provide agents and autonomous applications with the capability of localizing and conveying sounds more efficiently and with a higher level of perceptual awareness. At the same time, analysing 3D raw audio data in real-time poses new, significant research and implementation challenges preventing a successful deployment. Algorithms should be able to understand the spatial distribution of audio sources in the sound field while, at the same time, allowing for efficient inference from the raw waveforms in a variety of applications.
The aim of the HYD3A project (pronounced as ¿idea¿) is to design a family of deep learning algorithms tailored to such 3D audio signals for deployment in immersive environments. To accomplish this goal, the algorithms will leverage a new generation of deep neural networks to model and learn signals in hypercomplex (e.g., quaternion) domains.
Prior research has shown that 3D audio can be naturally modelled in a hypercomplex representation. HYD3A will build upon these insights to design a set of deep networks for analysing 3D audio coming from a variety of microphone sensors. Hypercomplex deep networks have the potential to reduce significantly the network complexity with respect to state-of-the-art competitors (thus simplifying their implementation on-device), while allowing for a more accurate learning and optimization procedure.
HYD3A is expected to have a positive impact, both in research and industry, for a range of problems involving the analysis of 3D audio, including immersive sound localization, audio enhancement, acoustic scene recognition, and audio super-resolution.

ERC: 
PE6_11
PE7_7
PE6_7
Componenti gruppo di ricerca: 
sb_cp_is_2162834
sb_cp_is_2217879
sb_cp_is_2180395
sb_cp_is_2161210
sb_cp_is_2220521
sb_cp_is_2161850
sb_cp_es_307065
sb_cp_es_307066
Innovatività: 

The representation and analysis of 3D audio signals in the hypercomplex domain involves several novel insights with respect to the state-of-the-art literature.
The ability of considering the 4 microphone signals of the Ambisonics as a single quaternionic signal not only facilitates the manageability of these signals but also brings advantages derived directly from the hypercomplex algebra. The spherical harmonics in the classical Euler representation [5] are transformed into quaternionic harmonics, which use the angular reference system zyx Tait-Bryan [5], called Cardano nautical angles, also used in many avionics and navigation applications. Once the transformation is performed, it is possible to exploit the advantages of quaternionic algebra in managing rotation operations, which can be performed by simply multiplying the quaternion ambisonic signal by predefined matrices based on the rotation direction. The rotation operation is fundamental for a microphone recording technique such as Ambisonics, because it allows the "phase steering", i.e., the electronical orientation towards a direction of interest without physically moving the microphone. Hence, rotations with quaternions are handy, involve faster computations and do not depend on the coordinate system.
Another important property is the "virtual miking", which allows to maximize the information captured from a certain direction or, on the contrary, to minimize it by pointing a "zero" of the radiation diagram towards an interfering sound source. This property can be performed directly in the quaternion domain through simple multiplications.
In addition to the advantages and novelty due to the hypercomplex algebraic properties, the HYD3A project can count on the benefits due to processing in the quaternion domain. One of the most critic problems in analysing multidimensional signals with deep neural networks is represented by the vanishing gradient that does not allow to identify intrinsic relations between two elements of a long input sequence placed at a certain distance from each other. Deep quaternion networks solve this problem by using the Hamiltonian product in the quaternion domain between input feature and network weights. This leads to exploit both internal and external dependencies in the features by handling sets of features. This property is even amplified in the case of 3D audio, since ambisonic signals are highly correlated with each other, thus DQNNs are able to extract much more information from them with respect to real-valued deep neural networks. This advantage strongly encourages the use of DQNNs for the analysis of 3D audio signals.
Further advantages and novelties with respect to existing solutions can be obtained also with respect to the required amount of computational resources. Performance advantages due to quaternion-domain processing make it possible to reduce the computational complexity of neural networks. In fact, a DQNN is able to obtain the same performance of the real-valued deep neural network counterpart using a much lower number of parameters. This allows numerous implementation advantages, allowing to extend the number of applications in which it is possible to perform an analysis of 3D audio, as well as using embedded devices for 3D audio applications.
Apart from the modelling aspects described above, the techniques developed in the HYD3A project have a large potential thanks to the range of possible applications in the context of 3D audio processing, as depicted in the last part of Fig. 1.
For example, audio super-resolution refers to the reconstruction of high-resolution audio content from a low-resolution sampling. In the case of a single microphone, [30] showed that deep networks can achieve results significantly superior to baseline methods. The algorithms developed in the context of the HYD3A project would allow to extend these results to the 3D audio case, with important applications in the context of improving the user¿s experience in immersive environments (by enlarging the bandwidth of a signal far beyond what is feasible today), or by decreasing the time needed to communicate audio signals in a distributed IoT environment. As we argued previously, developing these solutions in a hypercomplex domain also allows to decrease the computational budget needed by the neural networks, which is an important factor when deploying on embedded devices. For example, the solution in [30], despite having a single monaural input, can be run in real-time only on a sophisticated, GPU-enabled platform, and requires more than two days for a successful training.

Codice Bando: 
1709192

© Università degli Studi di Roma "La Sapienza" - Piazzale Aldo Moro 5, 00185 Roma