Announcing face and emotion detection for Azure Media Analytics

By Richard Li Program Manager, Azure Media Services

Announcing face and emotion detection for Azure Media Analytics • 5 min read

Posted on April 20, 2016
5 min read

Tag: Media Services & CDN

Azure Media Analytics is a collection of speech and vision services offered at enterprise scale, compliance, security and global reach. The services offered as part of Azure Media Analytics are built using the core Azure Media Services platform components and hence are ready to handle media processing at scale on day one itself. For other media processors included in this announcement, see Milan Gadas blog post Introducing Azure Media Analytics.

We are very excited for the free public preview of the Azure Media Face Detector, and this blog will detail the use and output of this technology. This Media Processor (MP) can be used for people counting, movement tracking, and even gauging audience participation and reaction via facial expressions. You can access these features in our new Azure portal, through our APIs with the presets below, or using the free Azure Media Services Explorer tool. This service contains two features, Face Detection and Emotion Detection, and I’ll be going over their details in that order.

Face detection

Face detection finds and tracks human faces within a video. Multiple faces can be detected and subsequently be tracked as they move around, with the time and location metadata returned in a JSON file. During tracking, it will attempt to give a consistent ID to the same face while the person is moving around on screen, even if they are obstructed or briefly leave the frame.

Note: This services does not perform facial recognition. An individual who leaves the frame or becomes obstructed for too long will be given a new ID when they return.

Input preset

Here’s and example of a JSON configuration preset for just face detection.

{“version”:”1.0″}

Input video

JSON output

(Truncated)

Download full

{
“version”: 1,
“timescale”: 30000,
“offset”: 0,
“framerate”: 29.97,
“width”: 1280,
“height”: 720,
“fragments”: [
    {
      “start”: 0,
      “duration”: 60060
    },
    {
      “start”: 60060,
      “duration”: 60060,
      “interval”: 1001,
      “events”: [
        [
          {
            “id”: 0,
            “x”: 0.519531,
            “y”: 0.180556,
            “width”: 0.0867188,
            “height”: 0.154167
          }
        ],
        [
          {
            “id”: 0,
            “x”: 0.517969,
            “y”: 0.181944,
            “width”: 0.0867188,
            “height”: 0.154167
          }
        ],

Emotion detection

Emotion detection is an optional component of the Face Detection Media Processor that returns analysis on multiple emotional attributes from the faces detected, including happiness, sadness, fear, anger, and more. This data is currently returned as an aggregate value of the whole window over a customizable window and interval.

Input configuration

JSON Preset sample:

{
'version': '1.0',
'options': {
    'aggregateEmotionWindowMs': '987',
    'mode': 'aggregateEmotion',
    'aggregateEmotionIntervalMs': '342'
}
}

Attribute Name	Description
Mode	Faces: Only face detection AggregateEmotion: Return average emotion values for all faces in frame.
AggregateEmotionWindowMs	Use if AggregateEmotion mode selected: The length of video used to produce each aggregate result, in milliseconds.
AggregateEmotionIntervalMs	Use if AggregateEmotion mode selected: How frequently to produce aggregate results.

Aggregate Defaults

Below are recommended values for the aggregate window and interval settings. Window should be longer than Interval.

	Defaults ( s)	Max ( s)	Min ( s)
Window Length	2	3	1
Interval	0.5	1	0.25

Output

JSON output for Aggregate Emotion(truncated)

Download Full

{
“version”: 1,
“timescale”: 30000,
“offset”: 0,
“framerate”: 29.97,
“width”: 1280,
“height”: 720,
“fragments”: [
    {
      “start”: 0,
      “duration”: 60060,
      “interval”: 15015,
      “events”: [
        [
          {
            “windowFaceDistribution”: {
              “neutral”: 0,
              “happiness”: 0,
              “surprise”: 0,
              “sadness”: 0,
              “anger”: 0,
              “disgust”: 0,
              “fear”: 0,
              “contempt”: 0
            },
            “windowMeanScores”: {
              “neutral”: 0,
              “happiness”: 0,
              “surprise”: 0,
              “sadness”: 0,
              “anger”: 0,
              “disgust”: 0,
              “fear”: 0,
              “contempt”: 0
            }
          }
        ],

Understanding the output

The face detection and tracking API provides high precision face location detection and tracking that can detect up to 64 human faces in a video. Frontal faces provide the best results, while side faces and small faces (less than or equal to 24×24 pixels) are challenging.

The detected and tracked faces are returned with coordinates (left, top, width, and height) indicating the location of faces in the image in pixels, as well as a face ID number indicating the tracking of that individual. Face ID numbers are prone to reset under circumstances when the frontal face is lost or overlapped in the frame, resulting in some individuals getting assigned multiple IDs.

JSON reference

For the face detection and tracking operation, the output result contains the metadata from the faces within the given file in JSON format.

The face detection and tracking JSON includes the following attributes:

Version: This refers to the version of the Video API.
Timescale: “Ticks” per second of the video.
Offset: This is the time offset for timestamps. In version 1.0 of Video APIs, this will always be 0. In future scenarios we support, this value may change.
Framerate: Frames per second of the video.
Fragments: The metadata is chunked up into different segments called fragments. Each fragment contains a start, duration, interval number, and event(s).
Start: The start time of the first event in ‘ticks’.
Duration: The length of the fragment, in “ticks”.
Interval: The interval of each event entry within the fragment, in “ticks”.
Events: Each event contains the faces detected and tracked within that time duration. It is an array of array of events. The outer array represents one interval of time. The inner array consists of 0 or more events that happened at that point in time. An empty bracket [ ] means no faces were detected
ID: The ID of the face that is being tracked. This number may inadvertently change if a face becomes undetected. A given individual should have the same ID throughout the overall video, but this cannot be guaranteed due to limitations in the detection algorithm (occlusion, etc.)
X, Y: The upper left X and Y coordinates of the face bounding box in a normalized scale of 0.0 to 1.0.
- X and Y coordinates are relative to landscape always, so if you have a portrait (or upside-down, in the case of iOS) video, you'll have to transpose the coordinates accordingly.
Width, Height: The width and height of the face bounding box in a normalized scale of 0.0 to 1.0.
facesDetected: This is found at the end of the JSON results and summarizes the number of faces that the algorithm detected during the video. Because the IDs can be reset inadvertently if a face becomes undetected (e.g. face goes off screen, looks away), this number may not always equal the true number of faces in the video.

The reason we have formatted the JSON in this way is to set the APIs up for future scenarios; where it will be important to retrieve metadata quickly and manage a large stream of results. We use both the techniques of fragmentation (allowing us to break up the metadata in time-based chunks, where you can download only what you need), and segmentation (allowing us to break up the events if they get too large). Some simple calculations can help you transform the data. For example, if an event started at 6300 (ticks), with a timescale of 2997 (ticks/sec) and framerate of 29.97 (frames/sec), then:

· Start/Timescale = 2.1 seconds

· Seconds x (Framerate/Timescale) = 63 frames

Below is a simple example of extracting the JSON into a per frame format for face detection and tracking:

var faceDetectionResultJsonString = operationResult.ProcessingResult;
var faceDetecionTracking =
JsonConvert.DeserializeObject(faceDetectionResultJsonString, settings);

Getting started

To use this service, simply create a Media Services account within your azure subscription and use our REST API/SDKs or with the Azure Media Services Explorer.

For sample code, check out the sample code on our documentation page.

Limitations

The supported input video formats include MP4, MOV, and WMV.
The detectable face size range is 24×24 to 2048×2048 pixels. The faces out of this range will not be detected.
For each video, the maximum number of faces returned is 64.
Some faces may not be detected due to technical challenges; e.g. very large face angles (head-pose), and large occlusion. Frontal and near-frontal faces have the best results.

Contact us

Keep up with the Azure Media Services blog to hear more updates on the Face Detection Media Processor and the Media Analytics initiative!

If you have any questions about any of the Media Analytics products, send an email to amsanalytics@microsoft.com.

Announcing face and emotion detection for Azure Media Analytics

Face detection

Input preset

Input video

JSON output

Emotion detection

Input configuration

Aggregate Defaults

Output

Understanding the output

JSON reference

Getting started

Limitations

Contact us

Explore

Related posts

Logic Apps, Flow connectors will make Automating Video Indexer simpler than ever

Get video insights in (even) more languages!

Build 2018: Video Indexer updates

Brand Detection in Microsoft Video Indexer

Join the conversation

Vorgestellt

KI + Machine Learning

Analysen

Compute

Container

Datenbanken

DevOps

Entwicklungstools

Hybrid Cloud und Multi Cloud

Identität

Integration

Internet der Dinge

Verwaltung und Governance

Medien

Migration

Mixed Reality

Mobil

Netzwerk

Sicherheit

Speicher

Web

Windows Virtual Desktop

Anwendungsfälle

Anwendungsbereitstellung

KI

Cloudmigration und -modernisierung

Daten und Analysen

Hybrid Cloud und Infrastruktur

Internet der Dinge

Sicherheit und Governance

Organisationstyp

Ressourcen

Face detection

Input preset

Input video

JSON output

Emotion detection

Input configuration

Aggregate Defaults

Output

Understanding the output

JSON reference

Getting started

Limitations

Contact us

Explore

Related posts

Join the conversation