Deep learning-based models have shown a remarkable performance in translating long sentences through the long short-term memory (LSTM) model. Now, they are also being used to extract information converting video to text through a mix of DCNN with a deep LSTM (DLSTM) for the task.
Bengaluru-based startup Spext provides a text editor for voice content that automatically converts a large amount of recording conversations to text, accurately aligns the text and words. Co-founded by Anup Gosavi, Spext is a SaaS platform wherein the user can upload the media on Spext, and get voice automatically converted to text and it also aligns the spoken words with the text accurately. Gosavi explains this is a text-based voice editor with which users can delete portion of the transcript. Just like a text editor, one can use Ctrl-C, Ctrl-V to create clips or search for keywords and hear them in context.
Understanding the technology behind Spext Intelligent Media
Accurate Time Coding: The editor is built on top of a new kind of intelligent media that includes time coded information (for example, when was a word spoken, who are the speakers, what was the context etc.) with media itself. This means it is more granular and richer than traditional media like .mp3, .mp4.
Serverless Interaction: Spext’s technology allows interaction with the media locally, which means there will be no API calls to any of Spext’s servers and hence it works in the browser. Users are not required to download any software.
Talking about the DL algorithms built for converting video to text, Gosavi shares the early stage has built DL algorithms to optimise the following functions:
Accurately Align Transcript With Spoken Words: Speech-to-text APIs are optimised for captions, so accurate timestamps of speech-text alignment are not expected or prioritised. Their algorithms accurately align this at microsecond level, so that any cuts or pastes of new media feel natural and don’t sound glitchy.
Punctuation: Spext works with long form media which is usually over 100 minutes in length and have our their punctuation algorithms that have around 85 percent accuracy. For a quality recording with minimum background noise, the models achieve an accuracy of 92-96 percent which is not very far from human transcription which has a 97-98 percent accuracy. However, the accuracy can be lower if the audio is of poor quality (sampling rate), has background noise or unclear accents.
Enterprise Uses For Spext
The early stage startup, founded in 2017 is working on the Enterprise version of Spext which will feature intelligent labeling and automatic tagging, in addition to transcribing. The startup is also keen to add three Indian companies to our customer advisory group to help them launch by 2019. “We started in 2017 with a focus on the US market but are developing technology to support local languages. We currently support English (Indian accent) and Hindi. But in the future, we want to support all the local languages – the market for voice content in Indian languages is enormous,” said Gosavi. Broadening from its initial use case — podcast — the startup us now testing its intelligent media technology for enterprise uses around voice content.
Voice Search Inside Media: The user can ask, “Show me when Messi scored a goal.” and instead of just showing a huge list of videos, Spext can find the time where Messi scored, automatically create a short two-minute clip from the long 90-minute game and play it.
Automatic Media Tagging: A huge amount of media is stored on storage solutions and it is physically impossible to tag it manually. Spext can automatically classify, transcribe and tag the media with tags like objects, when was a word spoken, what questions were asked, who are the speakers, what was the context etc. Editor will make it easy for employees to review or correct these tags if necessary. This will also save hundreds of thousands of man hours. “We plan to work with storage solutions so that the media is tagged while it being stored for the first time,” he said.
Manage Corporate Content Libraries And Company Archives: A lot of corporate media stored is not accessed often because it is stored as a large, monolithic file — for example it is not tagged by speaker, context, topic. So, it is not accessed and repurposed for marketing, training or analysis. Spext tags and classifies this media intelligently and makes this content shareable and clippable. That means archives become accessible and usable.
Spext platform is built on technologies provided by IBM, Google and Amazon. Gosavi explains the underlying technology is built on these APIs, making it API agnostic. “For example, depending on the type of file uploaded, we select one of the APIs that we think is most accurate. e.g for a phone call, we might select Amazon Transcribe but for a video lecture, we might select Google. This is done dynamically because the speech to text models are usually optimised for a particular type of media,” he explained.
This SaaS based platform has a monthly pricing. The subscription starts at $39.99 per month for four hours of media upload and includes bulk discounts (starting at $7 per hour) for additional hours if you have a lot of media.
Try deep learning using MATLAB