Bioacoustics & Deep Learning

Deep Learning Methods

Recent developments in deep learning approaches to detection and classification of signals from some species have proven successful at accurately detecting calls using trained models in varying noise conditions. Deep learning methods of detection and classification are increasing in utility within the underwater acoustic community (Allen et al., 2021; Shui et al., 2020; Vickers et al., 2021). Deep neural networks use automated means of spectrogram image classification that perform well when using an appropriately trained model. By means of rapid automated detection independent of operator experience, deep learning methods have the potential to minimize the drudgery of processing passive acoustic data, leading to more timely use of this information in research studies or biodiversity monitoring projects.

A variety of signal processing, pattern recognition and machine learning techniques have improved detection and classification of marine mammal sounds through automation, but vary with respect to performance level (Usman, et al., 2020). Deep learning models are a form of machine learning that applies different filter banks at different scales and determines features used to discriminate signals during a learning stage (Bianco, et al., 2019). The models are thus not reliant on meeting criteria for a series of target values, but rather independently determine important features using one of several neural networks. Several studies report high percentages of precision and recall, and improved performance when testing involved multiple datasets collected in variable acoustic conditions (Kirsebom, et al., 2020; Shiu et al., 2020). In addition to improved detection, deep neural networks offer capabilities in classifying marine mammal vocalizations, allowing for their use with multi-species analyses (Thomas et al., 2019).

Image-based Object Detection

Image-based object detection methods in bioacoustics leverage spectrograms generated from audio recordings as input for deep learning networks to detect specific calls or signals of interest, such as dolphin whistles. In this process, spectrograms undergo normalization and contrast enhancement to help the model distinguish between calls and background noise. As outlined in Fig. 1, annotations from the audio files are converted into image datastores, which are then used for object detection models in MATLAB. The duration of each spectrogram image is user-defined, with the segment length based on the duration and properties of the target calls of interest. Each spectrogram image is linked to coordinates that identify the location of calls within the image.

To help the model understand not just calls but also noise, spectrograms from audio segments without calls are also used. In these cases, random coordinates mimicking call-like features are applied to these images, allowing the model to learn the characteristics of “Noise” signals. This enables the network to better differentiate between calls and non-call elements in the spectrograms.

Further refinement of the spectrograms is achieved through contrast-limited adaptive histogram equalization (CLAHE), which enhances the contrast and normalizes the power spectral densities. Additional variability is introduced by augmenting the spectrograms, adjusting parameters like the Fast-Fourier Transform (FFT) size and overlap. The resulting images are then resized to meet the specific input requirements of the object detection networks, based on available computational resources (Table II). This approach ensures the networks are trained on a diverse range of signals, improving their ability to detect calls across different acoustic environments.

DeepAcoustics flow chart — Figure 1: Flow chart of analytical process from raw audio to network selection in DeepAcoustics

YOLO Method of Object-Detection

YOLO (You Only Look Once) is a real-time object detection algorithm that has gained significant attention for its speed and accuracy in identifying objects within images. Unlike traditional methods that use sliding windows or region proposals to detect objects, YOLO treats object detection as a single regression problem, predicting both the class and the bounding box coordinates in a single pass through the network. This streamlined approach allows YOLO to process images much faster, making it well-suited for tasks requiring real-time detection. The algorithm divides the input image into a grid and assigns each grid cell the responsibility of detecting an object if its center falls within that cell. By generating predictions for multiple bounding boxes and their associated confidence scores in one forward pass, YOLO can efficiently detect objects at various scales and positions within the image.

In the context of bioacoustics, YOLO can be adapted to detect specific sounds within spectrogram images. These spectrograms serve as visual representations of sound, and YOLO is trained to recognize the distinct patterns that correspond to specific calls. The model’s ability to quickly process entire images and localize calls within them makes it an effective tool for detecting signals of interest in large datasets. YOLO’s grid-based prediction system allows it to handle overlapping calls and background noise, which is critical in bioacoustic environments where signals can be complex and varied. By training the network on annotated spectrograms, YOLO can learn to differentiate between multiple classes of calls and noise in a single pass, allowing for rapid interpretation of large acoustic datasets.

References

Allen, A. N., Harvey, M., Harrell, L., Jansen, A., Merkens, K. P., Wall, C. C., … & Oleson, E. M. (2021). A convolutional neural network for automated detection of humpback whale song in a diverse, long-term passive acoustic dataset. Frontiers in Marine Science, 8, 165.
Bianco, M. J., Gerstoft, P., Traer, J., Ozanich, E., Roch, M. A., Gannot, S., & Deledalle, C. A. (2019). Machine learning in acoustics: Theory and applications. The Journal of the Acoustical Society of America, 146(5), 3590-3628.
Coffey, K. R., Marx, R. E., & Neumaier, J. F. (2019). DeepSqueak: a deep learning-based system for detection and analysis of ultrasonic vocalizations. Neuropsychopharmacology, 44(5), 859-868.
Kirsebom, O. S., Frazao, F., Simard, Y., Roy, N., Matwin, S., & Giard, S. (2020). Performance of a deep neural network at detecting North Atlantic right whale upcalls. The Journal of the Acoustical Society of America, 147(4), 2636-2646.
Shiu, Y., Palmer, K. J., Roch, M. A., Fleishman, E., Liu, X., Nosal, E. M., … & Klinck, H. (2020). Deep neural networks for automated detection of marine mammal species. Scientific reports, 10(1), 1-12.
Thomas, M., Martin, B., Kowarski, K., Gaudet, B., & Matwin, S. (2019). Marine mammal species classification using convolutional neural networks and a novel acoustic representation. arXiv preprint arXiv:1907.13188.
Usman, A. M., Ogundile, O. O., & Versfeld, D. J. (2020). Review of automatic detection and classification techniques for cetacean vocalization. IEEE Access, 8, 105181-105206.
Vickers, W., Milner, B., Risch, D., & Lee, R. (2021). Robust North Atlantic right whale detection using deep learning models for denoising. The Journal of the Acoustical Society of America, 149(6), 3797-3812.