|70||Podcast Analyzer with AI for Ringer (Sponsored)
|his project pitch was given by Prof. Patel on the 2/4 lecture. We have met with Prof. Patel about the details for this project and decided on the following.
Names: Elliot Couvignou (esc2), Sai Rajesh (srajesh2), Bhuvan Radj (br3)
For context, Ringr is an app that allows for high-quality recording of multiple clients regardless of their location and compiles the audio into one that sounds like they are in the same room. Their current problem is that analyzing the content of each recorded podcast and picking out the 'good parts' is obviously too time consuming to be done manually so an AI is needed. With this AI they would like to feed in any raw recordings and from that, get an audio recording of only the desired parts. They do want this feature eventually implemented on their app with a reasonable run-time. What is defined as ‘good’ parts is still left up for us to finalize when we meet with Ringr for the first time later this week.
Because this is a software only project we will go over the general flow of the AI from input to output and break that down into components. The overall idea of the AI is to break down the audio input into small pieces of audio and then reconstruct phonemes into words and those into word meaning, etc. up until the AI is aware about the recording's semantics and can pick apart the sections that the user wants. At the moment we plan to create the AI model(s) and save them so we don't need to recreate a new model on each input analysis. This should greatly increase performance to be usable as an app feature.
Solution Components: Listed in order from closest to Input (top) to farthest (bot)
Word Recognition: From the input we need to recognize words by breaking down audio into phonemes and discern the types through formant filters. From there we can combine our phonemes together to form words. Some extra touch up is needed here to make sure phoneme words are spelled correctly. From this we should now have a transcript of the input audio which is much easier to work with.
Semantic Recognition: We need to understand what the people are saying in order to know what's 'good'. This is a section that is covered by ECE 448 so we hope to use similar methods here such as word2vec or n-gram for HMM. This is probably the most challenging and intricate part as this is a rough topic to get right in AI.
’Good Part’ Recognition: Now that we know what the audio is talking about we can we look for portions of audio that closely resemble that of what the user wants and use those pieces in our resulting data. We compile each good slice into one and return this as our result.
The type of AI model we hope to use is some combination of RNNs and classifiers as we found this to be the most successful models in recent years. We want this model saved on the app but are aware that the size might grow too large from intelligence and require remote computing (Google does this).
Criterion For Success:
Our model works if it can correctly slice out unwanted segments and keep relevant segments of the recording data. This feature should be able to run on both mobile and desktop in a reasonable amount of time given the input length. If our model ends up using any libraries/API’s, then it still needs to be economical to use (i.e no paid services/licenses).