Azure Media OCR Public Preview
Not sure what Azure Media OCR is? Check out the introductory blog post on Azure Media OCR.
After gathering valuable feedback from the private preview, we are excited to announce the release of the Azure Media OCR Media Processor for Public Preview.
Along with a release to all public DCs and free availability for all Azure Media Services customers, the public preview release also comes with a new configuration schema, support for JSON and a new output format.
Configuration
After hearing user feedback from the private preview, we have simplified our configuration schema, and are now supporting both JSON and XML formats.
JSON
The following is a JSON configuration that detects English text oriented to the right (rotated 90 degrees clockwise) every 1.5 seconds in the following regions:
(Note: Image not to scale).
{ 'Version':'1.0', 'Options': { 'Language':'English', 'TimeInterval':'00:00:01.5', 'DetectRegions': [ {'Left':'12','Top':'10','Width':'150','Height':'100'}, {'Left':'190','Top':'200','Width':'160','Height':'140'} ], 'TextOrientation':'Right' } }
XML
The following is an XML configuration that detects text every 0.5 seconds, in any direction, in any supported language:
AutoDetect
The following table explains the attributes available in the Azure Media OCR configuration:
Language |
(Optional) Describes the language of text for which to look. One of the following: AutoDetect (default), Arabic, Chinese Simplified, Chinese Traditional, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian Cyrillic, Serbian Latin, Slovak, Spanish, Swedish, Turkish. |
TextOrientation |
(Optional) Describes the orientation of text for which to look. “Left” means that the top of all letters are pointed towards the left. Default text (like that which can be found in a book) can be called “Up” oriented. One of the following:
|
TimeInterval |
(Optional) describes the sampling rate. Default is every 1/2 second. JSON format – HH:mm:ss.SSS (default 00:00:00.500) XML format – W3C XSD duration primitive (default PT0.5) |
DetectRegions |
(Optional) An array of DetectRegion objects specifying regions within the video frame in which to detect text. A DetectRegion object is made of the following four integer values:
|
Using this input, you can highly optimize the OCR job for your use. If you are looking to use Azure Media OCR to extract hard-coded subtitles from a video, you can use the DetectRegion fields to make sure you only are receiving results from the lower 1/3 of the screen, for example.
Output
The output format has also been changed significantly, to be more inline with the other Azure Media Analytics Media Processors. At a high-level, it is a JSON object comprised of general video information and a series of time-based “fragments,” each containing 0 or more “events” representing metadata detected by the Media Processor (OCR in this case).
The above video was processed with the new version of Azure Media OCR with a TimeInterval value of 1 (sec). The following is part of the JSON output:
(Note: You can download the full JSON output here).
{ "version": 1, "timescale": 30000, "offset": 0, "framerate": 29.97, "width": 1280, "height": 720, "fragments": [ { "start": 0, "duration": 87087, "interval": 29029, "events": [ [ { "region": { "language": "English", "orientation": "Up", "lines": [ { "text": "Digital media landscape is always changing", "left": 386, "top": 143, "right": 1175, "bottom": 189, "word": [ { "text": "Digital", "left": 386, "top": 144, "right": 497, "bottom": 189, "confidence": 890 }, { "text": "media", "left": 514, "top": 144, "right": 624, "bottom": 179, "confidence": 998 }, { "text": "landscape", "left": 641, "top": 143, "right": 823, "bottom": 189, "confidence": 881 }, { "text": "is", "left": 838, "top": 145, "right": 861, "bottom": 179, "confidence": 996 }, { "text": "always", "left": 875, "top": 144, "right": 993, "bottom": 189, "confidence": 874 }, { "text": "changing", "left": 1007, "top": 144, "right": 1175, "bottom": 189, "confidence": 997 } ] } ] } }, { "region": { "language": "English", "orientation": "Up", "lines": [ { "text": "Video is the new currency", "left": 395, "top": 438, "right": 679, "bottom": 465, "word": [ { "text": "Video", "left": 395, "top": 438, "right": 458, "bottom": 459, "confidence": 994 }, { "text": "is", "left": 467, "top": 439, "right": 481, "bottom": 459, "confidence": 992 }, …
Understanding the output
The Video OCR output provides time-segmented data on the characters found in your video. You can use attributes such as language or orientation to hone-in on exactly the words that you are interested in analyzing.
The output contains the following attributes:
timescale | “ticks” per second of the video |
offset | time offset for timestamps. In version 1.0 of Video APIs, this will always be 0. |
framerate | frames per second of the video |
width | width of the video in pixels |
height | height of the video in pixels |
fragments | array of time-based chunks of video into which the metadata is chunked |
start | start time of a fragment in “ticks” |
duration | length of a fragment in “ticks” |
interval | interval of each event within the given fragment |
events | array containing regions |
region | object representing detected words or phrases |
language | language of the text detected within a region |
orientation | orientation of the text detected within a region |
lines | array of lines of text detected within a region |
text | the actual text |
If you want to learn more about this product, and the scenarios that it enables, read the introductory blog post on Azure Media OCR.
To learn more about Azure Media Analytics, check out the introductory blog post.
If you have any questions about any of the Media Analytics products, send an email to amsanalytics@microsoft.com.