Questions? Feedback? powered by Olark live chat software
Hopp over navigasjon

Announcing: Video OCR Public Preview

Posted on 28 juni, 2016

Program Manager, Azure Media Services

Azure Media OCR Public Preview

Not sure what Azure Media OCR is? Check out the introductory blog post on Azure Media OCR.

After gathering valuable feedback from the private preview, we are excited to announce the release of the Azure Media OCR Media Processor for Public Preview.

Along with a release to all public DCs and free availability for all Azure Media Services customers, the public preview release also comes with a new configuration schema, support for JSON and a new output format.

Configuration

After hearing user feedback from the private preview, we have simplified our configuration schema, and are now supporting both JSON and XML formats.

JSON

The following is a JSON configuration that detects English text oriented to the right (rotated 90 degrees clockwise) every 1.5 seconds in the following regions:

(Note: Image not to scale).

{
  'Version':'1.0', 
  'Options': 
  {
    'Language':'English', 
    'TimeInterval':'00:00:01.5',
    'DetectRegions': 
    [
      {'Left':'12','Top':'10','Width':'150','Height':'100'},
      {'Left':'190','Top':'200','Width':'160','Height':'140'}
    ],
    'TextOrientation':'Right'
  }
}

 

XML

The following is an XML configuration that detects text every 0.5 seconds, in any direction, in any supported language:

<?xml version=""1.0"" encoding=""utf-16""?>
<VideoOcrPreset xmlns:xsi=""http://www.w3.org/2001/XMLSchema-instance"" xmlns:xsd=""http://www.w3.org/2001/XMLSchema"" Version=""1.0"" xmlns=""http://www.windowsazure.com/media/encoding/Preset/2014/03"">
  <Options>
     <Language>AutoDetect</Language>
  </Options>
</VideoOcrPreset>

 

The following table explains the attributes available in the Azure Media OCR configuration:

Language

(Optional) Describes the language of text for which to look. One of the following:

AutoDetect (default), Arabic, Chinese Simplified, Chinese Traditional, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian Cyrillic, Serbian Latin, Slovak, Spanish, Swedish, Turkish.

TextOrientation

(Optional) Describes the orientation of text for which to look. “Left” means that the top of all letters are pointed towards the left. Default text (like that which can be found in a book) can be called “Up” oriented. One of the following:

  • AutoDetect (default)

  • Up

  • Right

  • Down

  • Left

TimeInterval

(Optional) describes the sampling rate. Default is every 1/2 second.

JSON format – HH:mm:ss.SSS (default 00:00:00.500)

XML format – W3C XSD duration primitive (default PT0.5)

DetectRegions

(Optional) An array of DetectRegion objects specifying regions within the video frame in which to detect text.

A DetectRegion object is made of the following four integer values:

  • Left – Pixels from the left-margin

  • Top – Pixels from the top-margin

  • Width – Width of the region in pixels

  • Height – Height of the region in pixels

 

Using this input, you can highly optimize the OCR job for your use. If you are looking to use Azure Media OCR to extract hard-coded subtitles from a video, you can use the DetectRegion fields to make sure you only are receiving results from the lower 1/3 of the screen, for example.

Output

The output format has also been changed significantly, to be more inline with the other Azure Media Analytics Media Processors. At a high-level, it is a JSON object comprised of general video information and a series of time-based “fragments,” each containing 0 or more “events” representing metadata detected by the Media Processor (OCR in this case).

 

The above video was processed with the new version of Azure Media OCR with a TimeInterval value of 1 (sec). The following is part of the JSON output:

(Note: You can download the full JSON output here).

{
  "version": 1,
  "timescale": 30000,
  "offset": 0,
  "framerate": 29.97,
  "width": 1280,
  "height": 720,
  "fragments": [
    {
      "start": 0,
      "duration": 87087,
      "interval": 29029,
      "events": [
        [
          {
            "region": {
              "language": "English",
              "orientation": "Up",
              "lines": [
                {
                  "text": "Digital media landscape is always changing",
                  "left": 386,
                  "top": 143,
                  "right": 1175,
                  "bottom": 189,
                  "word": [
                    {
                      "text": "Digital",
                      "left": 386,
                      "top": 144,
                      "right": 497,
                      "bottom": 189,
                      "confidence": 890
                    },
                    {
                      "text": "media",
                      "left": 514,
                      "top": 144,
                      "right": 624,
                      "bottom": 179,
                      "confidence": 998
                    },
                    {
                      "text": "landscape",
                      "left": 641,
                      "top": 143,
                      "right": 823,
                      "bottom": 189,
                      "confidence": 881
                    },
                    {
                      "text": "is",
                      "left": 838,
                      "top": 145,
                      "right": 861,
                      "bottom": 179,
                      "confidence": 996
                    },
                    {
                      "text": "always",
                      "left": 875,
                      "top": 144,
                      "right": 993,
                      "bottom": 189,
                      "confidence": 874
                    },
                    {
                      "text": "changing",
                      "left": 1007,
                      "top": 144,
                      "right": 1175,
                      "bottom": 189,
                      "confidence": 997
                    }
                  ]
                }
              ]
            }
          },
          {
            "region": {
              "language": "English",
              "orientation": "Up",
              "lines": [
                {
                  "text": "Video is the new currency",
                  "left": 395,
                  "top": 438,
                  "right": 679,
                  "bottom": 465,
                  "word": [
                    {
                      "text": "Video",
                      "left": 395,
                      "top": 438,
                      "right": 458,
                      "bottom": 459,
                      "confidence": 994
                    },
                    {
                      "text": "is",
                      "left": 467,
                      "top": 439,
                      "right": 481,
                      "bottom": 459,
                      "confidence": 992
                    },
…

Understanding the output

The Video OCR output provides time-segmented data on the characters found in your video. You can use attributes such as language or orientation to hone-in on exactly the words that you are interested in analyzing.

The output contains the following attributes:

timescale “ticks” per second of the video
offset time offset for timestamps. In version 1.0 of Video APIs, this will always be 0.
framerate frames per second of the video
width width of the video in pixels
height height of the video in pixels
fragments array of time-based chunks of video into which the metadata is chunked
start start time of a fragment in “ticks”
duration length of a fragment in “ticks”
interval interval of each event within the given fragment
events array containing regions
region object representing detected words or phrases
language language of the text detected within a region
orientation orientation of the text detected within a region
lines array of lines of text detected within a region
text the actual text

 

If you want to learn more about this product, and the scenarios that it enables, read the introductory blog post on Azure Media OCR.

To learn more about Azure Media Analytics, check out the introductory blog post.

If you have any questions about any of the Media Analytics products, send an email to amsanalytics@microsoft.com.