Automatic Drum Transcription (ADT) is a task that involves the isolation and identification of percussive events from audio signals. One of the major challenges in ADT research is the lack of large-scale labeled dataset featuring audio with polyphonic mixtures. To tackle this issue, we propose a semi-automatic way of compiling a labeled dataset using audio-to-MIDI alignment technique. The resulting dataset consists of 1565 polyphonic mixtures of music with audio-aligned MIDI ground truth. To validate the quality and generality of this dataset, an ADT model based on Convolutional Neural Network (CNN) is trained and evaluated on several publicly available datasets. The evaluation results suggest that our proposed methodology compares favorably with state-of-the-art ADT systems. The result also implies the possibility of leveraging audio-to-MIDI alignment in creating datasets for a broader range of audio-related tasks. (paper submitted to ICASSP 2021)
Please check the implemented ADT tool in Omnizart and download the dataset here.
In this project, a Variational Autoencoders (VAEs) based drum pattern generation model is built to generate symbolic drum patterns given an accompaniment that consists of melodic sequences. A self-similarity matrix (SSM) is incorporated in the process for encapsulating structural information. The subjective listening test highlight the model’s capability of creating musically meaningful transitions on structural boundaries.
For more details, please check our ISMIR 2019 conference paper, poster, slide and GitHub. Audio samples are also available here. (may take a while to open)
This is a project collaborated with Pacing Art Culture Education Foundation. In this work, a low-latency, efficient real-time score following system is proposed to predict the score position of a given song in a live classical concert. The system is implemented with Parallel Dynamic Time Warping (PDTW) algorithm on a multi-core system. Its output (predicted current score position) is used to drive a visualization system to project pre-programmed animation effects on screen. For more details, please check our MMSP 2018 conference paper and poster.
This is a project that utilizes off-line beat tracking package (Madmom) to perform the real-time beat tracking task for an live audio signal.
Madmom is a well known Music Information Retrieval (MIR) oriented python package that performs off-line beat tracking on music with state-of-the-art performance. However, It’s downbeat tracking result is not stable in particular scenarios. For example, It usually confuses with the location of the first beat and the third beat in a standard 4/4 time signature music clip.
To tackle this issue, a voting mechanism is employed to improve the downbeat tracking result. First, the live music input is segmented into multiple overlapped clips. Then, the downbeat tracking result of clips are merged by a simple majority vote mechanism. The program utilizes multi-core computation power to retrieve the beat location of multiple music segments simultaneously. Although the idea sounds simple and brutal, surprisingly, It works nicely and reliably with live music input.
This is a project using microphone as input source to query the music clip in an audio database. If the similarity value between the input and the database audio clip is higher than a certain threshold, a pre-programmed audio file is played as the query response. The system can be used in scenarios such as automatic accompaniment, human-AI duet or AI orchestra.
Chroma-based Dynamic Time Warping (DTW) is an extremely reliable method to calculate the similarity between audio clips. However, It employs heavy computation for chroma feature conversion and alignment path calculation. In this project, a two-stage audio filtering system is proposed to speed up the process for real-world applications. In the first stage, roughly 90% of audio files in the database can be filtered out using the twelve-bin chroma vector hash value comparison. Then, an optimized DTW algorithm is applied to rank the remaining clips based on their similarity to the input source. The most similar clip in the database will be regarded as the retrieved result, and, the system will act accordingly by its similarity value. With such a two-stage filtering mechanism, the designed system is very responsive and can be adapted and applied to any live interactive performance.
VOICE MIXER is an automated audio processing system to create personalized musical instruments using the user’s speech voice. The processing flow is presented in the above figure. First, the user is asked to read and record a short English passage (roughly 100 words) randomly extracted from Wikipedia. Then, the recorded speech voice is processed and segmented into thirty-nine phonemes using a neural network model trained on the TIMIT speech dataset. After that, the extracted phonemes are mixed with pre-processed audio materials to create three different humanized musical instruments (lead vocal, bass and drum kit). Finally, the instruments are applied to database MIDI files for music playback. Since those extracted phonemes vary largely between users, everyone can create their own unique musical instruments with their voice.
Please check the Colab demo file here.
—
Thanks for reading !