Azure Media Analytics is a collection of speech and vision services offered at enterprise scale, compliance, security and global reach. The services offered as part of Azure Media Analytics are built using the core Azure Media Services platform components and hence are ready to handle media processing at scale on day one itself. For other media processors included in this announcement, see Milan Gadas blog post Introducing Azure Media Analytics.
We are very excited for the free public preview of the Azure Media Face Detector, and this blog will detail the use and output of this technology. This Media Processor (MP) can be used for people counting, movement tracking, and even gauging audience participation and reaction via facial expressions. You can access these features in our new Azure portal, through our APIs with the presets below, or using the free Azure Media Services Explorer tool. This service contains two features, Face Detection and Emotion Detection, and I’ll be going over their details in that order.
Face detection finds and tracks human faces within a video. Multiple faces can be detected and subsequently be tracked as they move around, with the time and location metadata returned in a JSON file. During tracking, it will attempt to give a consistent ID to the same face while the person is moving around on screen, even if they are obstructed or briefly leave the frame.
Note: This services does not perform facial recognition. An individual who leaves the frame or becomes obstructed for too long will be given a new ID when they return.
Here’s and example of a JSON configuration preset for just face detection.
Emotion detection is an optional component of the Face Detection Media Processor that returns analysis on multiple emotional attributes from the faces detected, including happiness, sadness, fear, anger, and more. This data is currently returned as an aggregate value of the whole window over a customizable window and interval.
JSON Preset sample:
|Mode||Faces: Only face detection |
AggregateEmotion: Return average emotion values for all faces in frame.
|AggregateEmotionWindowMs||Use if AggregateEmotion mode selected: The length of video used to produce each aggregate result, in milliseconds.|
|AggregateEmotionIntervalMs||Use if AggregateEmotion mode selected: How frequently to produce aggregate results.|
Below are recommended values for the aggregate window and interval settings. Window should be longer than Interval.
|Defaults ( s)||Max ( s)||Min ( s)|
JSON output for Aggregate Emotion(truncated)
Understanding the output
The face detection and tracking API provides high precision face location detection and tracking that can detect up to 64 human faces in a video. Frontal faces provide the best results, while side faces and small faces (less than or equal to 24x24 pixels) are challenging.
The detected and tracked faces are returned with coordinates (left, top, width, and height) indicating the location of faces in the image in pixels, as well as a face ID number indicating the tracking of that individual. Face ID numbers are prone to reset under circumstances when the frontal face is lost or overlapped in the frame, resulting in some individuals getting assigned multiple IDs.
For the face detection and tracking operation, the output result contains the metadata from the faces within the given file in JSON format.
The face detection and tracking JSON includes the following attributes:
- Version: This refers to the version of the Video API.
- Timescale: “Ticks” per second of the video.
- Offset: This is the time offset for timestamps. In version 1.0 of Video APIs, this will always be 0. In future scenarios we support, this value may change.
- Framerate: Frames per second of the video.
- Fragments: The metadata is chunked up into different segments called fragments. Each fragment contains a start, duration, interval number, and event(s).
- Start: The start time of the first event in ‘ticks’.
- Duration: The length of the fragment, in “ticks”.
- Interval: The interval of each event entry within the fragment, in “ticks”.
- Events: Each event contains the faces detected and tracked within that time duration. It is an array of array of events. The outer array represents one interval of time. The inner array consists of 0 or more events that happened at that point in time. An empty bracket [ ] means no faces were detected
- ID: The ID of the face that is being tracked. This number may inadvertently change if a face becomes undetected. A given individual should have the same ID throughout the overall video, but this cannot be guaranteed due to limitations in the detection algorithm (occlusion, etc.)
- X, Y: The upper left X and Y coordinates of the face bounding box in a normalized scale of 0.0 to 1.0.
- X and Y coordinates are relative to landscape always, so if you have a portrait (or upside-down, in the case of iOS) video, you'll have to transpose the coordinates accordingly.
- Width, Height: The width and height of the face bounding box in a normalized scale of 0.0 to 1.0.
- facesDetected: This is found at the end of the JSON results and summarizes the number of faces that the algorithm detected during the video. Because the IDs can be reset inadvertently if a face becomes undetected (e.g. face goes off screen, looks away), this number may not always equal the true number of faces in the video.
The reason we have formatted the JSON in this way is to set the APIs up for future scenarios; where it will be important to retrieve metadata quickly and manage a large stream of results. We use both the techniques of fragmentation (allowing us to break up the metadata in time-based chunks, where you can download only what you need), and segmentation (allowing us to break up the events if they get too large). Some simple calculations can help you transform the data. For example, if an event started at 6300 (ticks), with a timescale of 2997 (ticks/sec) and framerate of 29.97 (frames/sec), then:
· Start/Timescale = 2.1 seconds
· Seconds x (Framerate/Timescale) = 63 frames
Below is a simple example of extracting the JSON into a per frame format for face detection and tracking:
var faceDetectionResultJsonString = operationResult.ProcessingResult;
var faceDetecionTracking =
For sample code, check out the sample code on our documentation page.
- The supported input video formats include MP4, MOV, and WMV.
- The detectable face size range is 24x24 to 2048x2048 pixels. The faces out of this range will not be detected.
- For each video, the maximum number of faces returned is 64.
- Some faces may not be detected due to technical challenges; e.g. very large face angles (head-pose), and large occlusion. Frontal and near-frontal faces have the best results.
Keep up with the Azure Media Services blog to hear more updates on the Face Detection Media Processor and the Media Analytics initiative!
If you have any questions about any of the Media Analytics products, send an email to email@example.com.