• 5 min read

Introducing Video Indexer, a cloud service to unlock insights from your videos

The amount of video on the internet is growing exponentially and will continue to grow for the foreseeable future. And the growth is across multiple industries, such as entertainment, broadcasting, enterprises, public safety and others.

The amount of video on the internet is growing exponentially and will continue to grow for the foreseeable future. And the growth is across multiple industries, such as entertainment, broadcasting, enterprises, public safety and others.

Most videos only have a title and a description as metadata associated with the videos. But a video is a lot more than what the title and description typically capture (especially when the video is more than a minute long). The lack of human understandable time-stamped metadata makes discoverability of videos, and the relevant moments within a video, a challenging task. Generating such metadata for videos is expensive and humanly impossible when you have lots of videos. This is where artificial intelligence technologies can help.

At Build 2017, we are announcing the public preview of a cloud service called Video Indexer as part of Microsoft Cognitive Services. Video Indexer enables customers with digital video and audio content to automatically extract metadata and use it to build intelligent innovative applications. Video Indexer is built using Azure Media Services, Microsoft Cognitive Services, Azure Search, Azure Storage and Azure Document DB. It brings the best of Microsoft AI technologies for video, in the form of a scalable cloud service for customers.

Video Indexer has been built based on feedback from customers on the Microsoft Garage project called Video Breakdown, which was launched in September 2016. Customers across multiple industries have experimented with Video Breakdown. The following quote from Jonathan Huberman, CEO, Ooyala is a testament of the value of Video Indexer – “As a global provider of video monetization software and services, we are constantly looking for technologies that would help us provide more value to our customers. With Azure and Microsoft’s AI technologies for processing video we were really impressed with the combination of easy to use yet powerful AI services for videos. The integrations we have built between Video Indexer and our products will help our customers enhance content discovery and captioning as well as deliver targeted advertising based on the extracted metadata – a win-win for our customers and their viewers.”

You don’t need to have any background in machine learning or computer vision to use Video Indexer. You can even get started without writing a single line of code. Video Indexer offers simple but powerful APIs and puts the power of AI technologies for video within the reach of every developer. You can learn more about this in the Video Indexer documentation.

In what follows, let’s look at some of the customer use cases enabled by Video Indexer, followed by a high level description of features.

Customer use cases

  • Search – Insights extracted from the video can be used to enhance the search experience across a video library. For example, indexing spoken words and faces can enable the search experience of finding moments in a video where a particular person spoke certain words or when two people were seen together. Search based on such insights from videos is applicable to news agencies, educational institutes, broadcasters, entertainment content owners, enterprise LOB apps and in general to any industry that has a video library that users need to search against.
  • Monetization – Video Indexer can help improve the value of videos. As an example, industries that rely on ad revenue (e.g. news media, social media, etc.), can deliver more relevant ads by using the extracted insights as additional signals to the ad server (presenting a sports shoe ad is more relevant in the middle of a football match vs. a swimming competition).
  • User engagement – Video insights can be used to improve user engagement by positioning the relevant video moments to users. As an example, consider an educational video that explains spheres for the first 30 minutes and pyramids in the next 30 minutes. A student reading about pyramids would benefit more if the video is positioned starting from the 30 minute marker.


At a high level, REST APIs includes the following functionalities. For more details, please take a look at the Video Indexer documentation.


  • Content upload – You can upload videos by providing a URL. Video Indexer starts processing videos as soon as they are uploaded. Multiple AI technologies are used to extract insights across multiple dimensions (spoken words, faces, visual text, objects, etc.)
  • Insights download – Once a video finishes processing, you can download the extracted insights in the form of a JSON file.
  • Search – You can submit search queries for searching for relevant moments within a video or for moments across all videos in your Video Indexer account.
  • Player widget – You can obtain a player widget for a video, that you can embed in any web application. Player widget would enable you to stream the video using adaptive bit rate.
  • Insights widget – You can also obtain an insights widget for showcasing the extracted insights. Just like the player widget, an insights widget can be embedded in any web application. You can also choose which parts of the insights widget you want to show and which you want to hide.


The Video Indexer portal enables you to

  • Upload videos from a local machine.
  • View the insights extracted from the video in a UI built using the various widgets mentioned above.
  • Curate the insights and submit that back to the service. This would include providing names for faces that have been detected but not recognized, making corrections in text extracted based on spoken words or based on optical character recognition.
  • Obtain an embed code for the player or insights widget.

Video Indexer includes the following video AI technologies. Each technology listed below is applied to every video that is uploaded to Video Indexer.

  • Audio Transcription – Video Indexer has speech-to-text functionality which enables customers to get a transcript of the spoken words. Supported languages include English, Spanish, French, German, Italian, Chinese (Simplified), Portuguese (Brazilian), Japanese and Russian (with many more to come in the future). The speech-to-text functionality is based on the same speech engine that is used by Cortana and Skype.
  • Face tracking and identification– Face technologies enable detection of faces in a video. The detected faces are matched against a celebrity database to evaluate which celebrities are present in the video. Customers can also label faces that do not match a celebrity. Video Indexer builds a face model based on those labels and can recognize those faces in videos submitted in the future.
  • Speaker indexing – Video Indexer has the ability to map and understand which speaker spoke which words and when.
  • Visual text recognition – With this technology, Video Indexer service extracts text that is displayed in the videos.
  • Voice activity detection – This enables Video Indexer to detect silence, speech and hand-clapping. 
  • Scene detection – Video Indexer has the ability to perform visual analysis on the video to determine when a scene changes in a video.
  • Keyframe extraction – Video Indexer automatically detects keyframes in a video.
  • Sentiment analysis – Video Indexer performs sentiment analysis on the text extracted using speech-to-text as well as optical character recognition, and provide that information in the form of positive, negative of neutral sentiments, along with timecodes.
  • Translation – Video Indexer has the ability to translate the audio transcript from one language to another. Multiple languages (English, Spanish, French, German, Italian, Chinese-Simplified, Portuguese-Brazilian,  Japanese and Russian) are supported. Once translated, the user can even get captioning in the video player in other languages.
  • Visual content moderation – This technology enables detection of adult and/or racy material present in the video and can be used for content filtering.
  • Keywords extraction – Video Indexer extracts keywords based on the transcript of the spoken words and text recognized by visual text recognizer.
  • Annotation – Video Indexer annotates the video based on a pre-defined model of 2000 objects.

We hope you share our excitement about the new opportunities Video Indexer enables to transform your apps and your business. We are looking forward to seeing how you will use this new service. Try it out today at https://vi.microsoft.com.