How is AI for video different from AI for images

Publisert på 1 februar, 2018

Principal Program Manager, Azure Media Services

Extracting insights from video, or using AI technologies, presents an additional set of challenges and opportunities for optimization as compared to images. There is a misconception that AI for video is simply extracting frames from a video and running computer vision algorithms on each video frame. While you can certainly do that but that would not help you get the insights that you are truly after. In this blog post, I will use a few examples to explain the shortcomings of taking an approach of just processing individual video frames. I will not be going over the details of the additional algorithms that are required to overcome these shortcomings. Video Indexer implements several such video specific algorithms.

Person presence in the video

Look at the first 25 seconds of this video.

Notice that Doug is present for the entire 25 seconds.

If I were to draw a timeline for when Doug is present in the video, it should be something like this.



Note the fact that Doug is not always facing the camera. Seven seconds in the video he is looking at Emily. Same thing happens at 23 seconds.

If you were to run face detection at these times in the video, Doug’s face will not be detected (see screenshots below).






In other words, if you just do face detection on each video frame you would not be able to draw a timeline as shown above. To get to the timeline, you would need to be able to track the face across video frames and account for side views of the face in between. Video Indexer does face tracking and as a result you see the full timeline that was showcased earlier.

Extracting topics/keywords using optical character recognition

Look at the following two frames.


These two frames are from a video where the presenter is on stage and obscuring the word “Microsoft” that is imprinted on the back wall. As a human you know what the word is “Microsoft”. If you run OCR over these two images, you will get “Microsc” and “crosoft” as the output. If you process the full sequence of frames in the video clip, you will get a lot of partial words like this. To reduce the noise and extract the correct word over the sequence of frames in a shot, you will need to apply an algorithm on the partial words. Video Indexer does so and hence you get better insights from the video.

Face recognition

A face recognition system consists of a face database that is built using a set of training images per person. It also provides a query function that does the job of extracting facial features from the query image and matching it against the face database. The output from the query function consists of a list of possible matches along with confidence values. The quality of the output of the query function will depend on the quality of the face database and the query image.

In the case of video, there will be multiple frames of video where the person is present in different head poses and lighting condition. One can take the approach of taking each frame where a person is present and querying the face recognition system. If you do that you will have a list of possible matches from the face database with different confidence values. There is also no guarantee that the potential matches will be the same across the sequence of frames. In other words, there is a need of an additional logic layer to determine the matching face. There is also an optimization opportunity to reduce the number of queries against the face recognition system by selecting the appropriate subset of frames to query against the face recognition system.

Video also provides the opportunity to build and augment the face database by using the correct variation of training images for a person from multiple video frames. This is possible if you have the logic to track a person across frames and a heuristic algorithm to evaluate the variations. Video Indexer does this and hence is able to build a higher quality face database from the provided videos.