Last summer, I was able to participate in the summer school and the JSALT workshop, organized by Johns Hopkins. The sixth edition of the event Frederick Jelinek was happening, and it was held in Canada for the first time, at ÉTS (École de technologie supérieure)!
As explained by the director of the software engineering and information technology department at ÉTS, Patrick Cardinal, “the summer school aims to advance scientific knowledge in the field of speech technologies, but it also serves to stimulate student interest in research.” It was with this objective that I participated in the event. I wanted to find out what research is and if it is something I would like to pursue.
In other words, after studying many years in tech and graduating soon, I wanted to find out if I would like to do a Ph.D…
The summer school – June 10 to 21
Picture by Jan Trmal
The first two weeks of the event were mainly for students, whether we were undergrad, masters or PhD candidates. Every day, a different subject was taught and we had the opportunity to learn about topics that are out of our “expertise”. For example, I learned about languages and the construction of words, while linguists in the room had the chance to learn more about machine learning.
Here is a list of some topics that were covered during JSALT summer school:
- Automatic Speech Recognition (ASR) with Kaldi
- NLP
- Computer Vision
- Information Retrieval
- Neural Networks, Convolutional Neural Networks, Recurrent NN, Attention Mechanism, and Transformer
- Machine Translation
- Advanced Machine Learning
- Using Cooperative Ad-hoc Microphone Arrays for ASR
- Speaker Detection in Adverse Scenarios with a Single Microphone
- Neural Polysynthetic Language Modeling: Leveraging Related Low-Resource Languages and Rule-Based Resources
The slides of the presentations are available to the public on the website of the summer school.
Picture by Jan Trmal
In the morning, we had a classic course taught by a professor. However, in the afternoon, we were doing labs with hands-on assignments to put in practice the theory we had learned in the morning.
During these first two weeks, we took the time to get to know one another. We went to the pub at ETS, climbed Mont-Royal, went for walks along the Lachine canal, and went on Peel street to watch the Toronto Raptors win the tournament!
The JSALT workshop – June 24 to August 2
The next 6 weeks were spent working as a team on a particular subject. There were five teams in the workshop in 2019:
- Speaker Detection in Adverse Scenarios with a Single Microphone
- Distant Supervision for Representation Learning
- Improving Translation of Informal Language
- Using Cooperative Ad-hoc Microphone Arrays for ASR
- Neural Polysynthetic Language Modeling
My team and I were working on Speaker Detection in Adverse Scenarios with a Single Microphone. The objective of our research was to improve speaker recognition (detect who is talking) and speaker diarization (who speaks when).
The work was done on datasets (BabyTrain, AMI, CHiME5, SRI) on which it is harder at the moment to obtain good results on speaker recognition and speaker diarization. The recordings of the audio files were made when speakers were far from the microphone and in uncontrolled environments. For example, in CHiME5, some recordings were made when a family was eating at the dinner table. Therefore, there are kitchen noises, maybe a television that’s turned on in another room, etc. that makes the task difficult to do.
For the BabyTrain dataset, the audio recordings were made for the duration of the whole day, while the child wore a microphone. This makes the data very difficult to analyze because there can be many scenarios: when the child is at the park and there are more noises around in the environment /or when it’s nap time and parents are whispering. These different contexts make it difficult for an algorithm to be able to generalize and perform well when there are a lot of variations in the audio recordings.
As we were a big team of 25 people, 4 strategic tasks were identified, acting as sub-teams:
- Enhancement
- Super VAD
- End-to-end diarization
- Resegmentation
I was working with the end-to-end diarization team.
My research at JSALT: Adversarial Training of Voice Activity Detection
In a few words, my research this summer focused on adversarial training of voice activity detection. It is quite difficult to popularize these concepts, so I will do my best! Feel free to ask me questions in the comments.
Diarization
Speaker diarization answers the question of “who spoke when?”. The following figure shows different steps that lead to speaker diarization. Therefore, from an audio recording, the first step is called “voice activity detection”. This step is used to identify sequences where there are voices in the audio file.
The second step is called “speaker change detection”. This step separates voices in segments when there is a change in the speaker. So instead of just knowing that “someone is speaking here”, we will say “someone is speaking here, but at the 3rd second, it’s someone else”.
The final step, “speaker diarization”, takes all the segments of speakers and identifies segments that are from the same speaker.
Figure made by Alex Crista
On my end, I was focused on the Voice Activity Detection (VAD) step.
Each step I previously explained has its own metric to evaluate the quality of the results obtained. For VAD, the metric used is called Detection Error Rate (DER, not to be confused with Diarization Error Rate).
The challenge of VAD : adapting to different domains
A challenge for voice activity detection, as explained earlier, is to adapt to the different domains in audio recordings.
A domain can be considered the recording conditions of a file. For example, if we’re recording in a party with a lot of noise, this is considered a domain. If I later decide to record in my car where there’s almost no sound, that’s an entirely different domain.
If we use a voice activity detection algorithm on these files that were recorded in different conditions, the performance would be greatly affected.
Result of our research
To read more about the work done, you can read these papers:
Paola Garcia, Jesus Villalba, Herve Bredin, Jun Du, Diego Castan, Alejandrina Cristia, Latane Bullock, Ling Guo, Koji Okabe, Phani Sankar Nidadavolu, Saurabh Kataria, Sizhu Chen, Anonyme, Marvin Lavechin, Lei Sun, Marie-Philippe Gill, Bar Ben-Yair, Sajjad Abdoli, Xin Wang, Wassim Bouaziz, Hadrien Titeux, Emmanuel Dupoux, Kong Aik Lee, Najim Dehak. “Speaker detection in the wild: Lessons learned from JSALT 2019.” arXiv preprint arXiv:1912.00938 (2019).
Marvin Lavechin, Marie-Philippe Gill, Ruben Bousbib, Hervé Bredin, Leibny Paola Garcia-Perera. “End-to-end Domain-Adversarial Voice Activity Detection.” arXiv preprint arXiv:1910.10655 (2019).
Hervé Bredin, Ruiqing Yin, Juan Manuel Coria, Gregory Gelly, Pavel Korshunov, Marvin Lavechin, Diego Fustes, Hadrien Titeux, Wassim Bouaziz, Marie-Philippe Gill. “pyannote. audio: neural building blocks for speaker diarization.” arXiv preprint arXiv:1911.01255 (2019).
My experience attending JSALT as an undergrad
A few months later, I finally realized how much this research experience has brought me. I was surrounded by inspiring researchers who generously gave me their time to explain several concepts that I did not yet understand. I think of Hervé Bredin and Marvin Lachevin in particular, two people with whom I had worked more closely.
I was able to practice presenting my research results during the Brown Bag Lunch, where undergrad students were invited to present their work to senior researchers. By preparing for these presentations, I was able to present my PowerPoint in advance to my team members and get their comments in order to improve my presentation. It truly was an enriching experience for me!
Here is the presentation that I had made during the closing ceremony of the event, it is a few minutes long. All of the team members had to present the results of their work:
Conclusion
My experience at JSALT was both wonderful and decisive in my journey! I believe that any undergraduate student studying in the United States can apply in order to participate in this annual event. If you meet the criteria and you are interested in machine learning and speech research, I strongly encourage you to apply!