Analyze Live Video in Near Real Time with Microsoft Cognitive Services APIs

Inlägg på 8 september, 2016

Senior Program Manager, Cognitive Services

Microsoft Cognitive Services Vision APIs put state-of-the art computer vision algorithms at developers' fingertips, with APIs for analyzing individual images, and for offline video processing. We want to showcase how you can use a combination of those APIs to create a solution that can perform near-real-time analysis of frames taken from a live video stream. You could, in theory, build an app that analyzes live television events, videos of crowds, or reactions from people, or gives real-time information about what a person might be feeling.


This solution might be especially useful for developers looking to create an app that generates useful data from video streams. For example, a developer might want to create an app that can read the reactions of a 10a-person focus group while those people are shown a new product or browse through a website. This solution can do that in near real time.
In this post, I will discuss some ways you might achieve near-real-time video analysis using APIs, and a C# library we're publishing, to make it easier to build your own solution.

The basic components in such a system are:

  • Acquire frames from a video source.
  • Select which frames to analyze.
  • Submit these frames to the API.
  • Consume each analysis result that is returned from the API call.

If you just want the sample code, you can find it on GitHub: https://github.com/Microsoft/Cognitive-Samples-VideoFrameAnalysis/.


A Simple Approach


The simplest design for a near-real-time analysis system is an infinite loop, where in each iteration we grab a frame, analyze it, and then consume the result:


while (true)
{

Frame f = GrabFrame();
if (ShouldAnalyze(f))
{

AnalysisResult r = await Analyze(f);
ConsumeResult(r);

}

}


If our analysis consisted of a lightweight client-side algorithm, this approach would be suitable. However, when our analysis is happening in the cloud, the latency involved means that an API call might take several seconds, during which time we are not capturing images and our thread is essentially doing nothing. Our maximum frame rate is limited by the latency of the API calls. So we need a solution that makes the APIs work in tandem.

Parallelizing API Calls


The solution to this problem is to allow the long-running API calls to execute in parallel with the program elements that grab frames. In C#, we could achieve this goal using task-based parallelism. For example:


while (true)
{

Frame f = GrabFrame();
if (ShouldAnalyze(f))
{

var t = Task.Run(async () =>

{

AnalysisResult r = await Analyze(f);

ConsumeResult(r);

}

}

}


This launches each analysis in a separate task, which can run in the background while we continue grabbing new frames. This solution avoids blocking the main thread while waiting for an API call to return; however we have lost some of the guarantees that the simple version provided—multiple API calls might occur in parallel, and the results might get returned in the wrong order. This could also cause multiple threads to enter the ConsumeResult() function simultaneously, which could be dangerous, if the function is not thread safe. Finally, this simple code does not keep track of the tasks that get created, so exceptions will silently disappear. Thus, the final ingredient for us to add is a "consumer" thread that will track the analysis tasks, raise exceptions, kill long-running tasks, and ensure that the results get consumed in the correct order, one at a time.

A Producer-Consumer Design


In our final "producer-consumer" system, we have a producer thread that looks very similar to our previous infinite loop. However, instead of consuming analysis results as soon as they are available, the producer puts the tasks into a queue to keep track of them.

// Queue that will contain the API call tasks.
var taskQueue = new BlockingCollection<Task<ResultWrapper>>();

// Producer thread.
while (true)
{

// Grab a frame.
Frame f = GrabFrame();

// Decide whether to analyze the frame.
if (ShouldAnalyze(f))
{

// Start a task that will run in parallel with this thread.
var analysisTask = Task.Run(async () =>
{

// Put the frame, and the result/exception into a wrapper object.
var output = new ResultWrapper(f);
try
{

output.Analysis = await Analyze(f);

}
catch (Exception e)
{

return output;

}

}

// Push the task onto the queue.
taskQueue.Add(analysisTask);

}

}


We also have a consumer thread that is taking tasks off the queue, waiting for them to finish, and either displaying the result or raising the exception that was thrown. By using the queue, we can guarantee that results get consumed one at a time, in the correct order, without limiting the maximum frame rate of the system.

Sample Implementation


Together with this blog post, we are making available an implementation of the system described above. It is intended to be flexible enough to implement many such scenarios while being easy to use. The library contains the class FrameGrabber, which implements the producer-consumer system discussed previously to process video frames from a webcam. The user can specify the exact form of the API call, and the class uses events to let the calling code know when a new frame is acquired or a new analysis result is available.


To illustrate some of the possibilities, we are also publishing two sample apps that use the library. The first is a simple console app, and a simplified version of this app is reproduced below. It grabs frames from the default webcam and submits them to the Face API for face detection.

using System;
using VideoFrameAnalyzer;
using Microsoft.ProjectOxford.Face;
using Microsoft.ProjectOxford.Face.Contract;

namespace VideoFrameConsoleApplication
{

class Program
{

static void Main(string[] args)
{

// Create grabber, with analysis type Face[].
FrameGrabber<Face[]> grabber = new FrameGrabber<Face[]>();

// Create Face API Client. Insert your Face API key here.
FaceServiceClient faceClient = new FaceServiceClient("<subscription key>");

// Set up our Face API call.
grabber.AnalysisFunction = async frame => return await faceClient.DetectAsync(frame.Image.ToMemoryStream(".jpg"));

// Set up a listener for when we receive a new result from an API call.
grabber.NewResultAvailable += (s, e) =>
{

if (e.Analysis != null)
Console.WriteLine("New result received for frame acquired at {0}. {1} faces detected", e.Frame.Metadata.Timestamp, e.Analysis.Length);

};

// Tell grabber to call the Face API every 3 seconds.
grabber.TriggerAnalysisOnInterval(TimeSpan.FromMilliseconds(3000));

// Start running.
grabber.StartProcessingCameraAsync().Wait();

// Wait for keypress to stop
Console.WriteLine("Press any key to stop...");
Console.ReadKey();

// Stop, blocking until done.
grabber.StopProcessingAsync().Wait();

}

}

}


The second sample app is a bit more interesting and allows you to choose which API to call on the video frames. On the left side, the app shows a preview of the live video; on the right side, it shows the most recent API result overlaid on the corresponding frame.

In most modes, there will be a visible delay between the live video on the left, and the visualized analysis on the right. This delay is the time taken to make the API call. The exception to this rule is in the "EmotionsWithClientFaceDetect" mode, which performs face detection locally on the client computer using OpenCV before submitting any images to Cognitive Services. By doing this, we can visualize the detected face immediately, and then update the emotions later once the API call returns. This demonstrates the possibility of a "hybrid" approach, where some simple processing can be performed on the client, and then Cognitive Services APIs can be used to augment this approach with more advanced analysis when necessary.

I hope this post has given you a sense of the possibilities for performing near-real-time image analysis on a live video stream, and how you can use our sample code to get started. Please feel free to provide feedback and suggestions in the comments below, on our UserVoice site, or in the GitHub repository for the sample code.