Announcing: Video OCR Public Preview

By Adarsh Solanki Program Manager, Azure Media Services

Announcing: Video OCR Public Preview • 3 min read

Posted on June 28, 2016
3 min read

Azure Media OCR Public Preview

Not sure what Azure Media OCR is? Check out the introductory blog post on Azure Media OCR.

After gathering valuable feedback from the private preview, we are excited to announce the release of the Azure Media OCR Media Processor for Public Preview.

Along with a release to all public DCs and free availability for all Azure Media Services customers, the public preview release also comes with a new configuration schema, support for JSON and a new output format.

Configuration

After hearing user feedback from the private preview, we have simplified our configuration schema, and are now supporting both JSON and XML formats.

JSON

The following is a JSON configuration that detects English text oriented to the right (rotated 90 degrees clockwise) every 1.5 seconds in the following regions:

(Note: Image not to scale).

{
  'Version':'1.0', 
  'Options': 
  {
    'Language':'English', 
    'TimeInterval':'00:00:01.5',
    'DetectRegions': 
    [
      {'Left':'12','Top':'10','Width':'150','Height':'100'},
      {'Left':'190','Top':'200','Width':'160','Height':'140'}
    ],
    'TextOrientation':'Right'
  }
}

XML

The following is an XML configuration that detects text every 0.5 seconds, in any direction, in any supported language:



  
     AutoDetect

The following table explains the attributes available in the Azure Media OCR configuration:

Language	(Optional) Describes the language of text for which to look. One of the following: AutoDetect (default), Arabic, Chinese Simplified, Chinese Traditional, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian Cyrillic, Serbian Latin, Slovak, Spanish, Swedish, Turkish.
TextOrientation	(Optional) Describes the orientation of text for which to look. “Left” means that the top of all letters are pointed towards the left. Default text (like that which can be found in a book) can be called “Up” oriented. One of the following: AutoDetect (default) Up Right Down Left
TimeInterval	(Optional) describes the sampling rate. Default is every 1/2 second. JSON format – HH:mm:ss.SSS (default 00:00:00.500) XML format – W3C XSD duration primitive (default PT0.5)
DetectRegions	(Optional) An array of DetectRegion objects specifying regions within the video frame in which to detect text. A DetectRegion object is made of the following four integer values: Left – Pixels from the left-margin Top – Pixels from the top-margin Width – Width of the region in pixels Height – Height of the region in pixels

Using this input, you can highly optimize the OCR job for your use. If you are looking to use Azure Media OCR to extract hard-coded subtitles from a video, you can use the DetectRegion fields to make sure you only are receiving results from the lower 1/3 of the screen, for example.

Output

The output format has also been changed significantly, to be more inline with the other Azure Media Analytics Media Processors. At a high-level, it is a JSON object comprised of general video information and a series of time-based “fragments,” each containing 0 or more “events” representing metadata detected by the Media Processor (OCR in this case).

The above video was processed with the new version of Azure Media OCR with a TimeInterval value of 1 (sec). The following is part of the JSON output:

(Note: You can download the full JSON output here).

{
  "version": 1,
  "timescale": 30000,
  "offset": 0,
  "framerate": 29.97,
  "width": 1280,
  "height": 720,
  "fragments": [
    {
      "start": 0,
      "duration": 87087,
      "interval": 29029,
      "events": [
        [
          {
            "region": {
              "language": "English",
              "orientation": "Up",
              "lines": [
                {
                  "text": "Digital media landscape is always changing",
                  "left": 386,
                  "top": 143,
                  "right": 1175,
                  "bottom": 189,
                  "word": [
                    {
                      "text": "Digital",
                      "left": 386,
                      "top": 144,
                      "right": 497,
                      "bottom": 189,
                      "confidence": 890
                    },
                    {
                      "text": "media",
                      "left": 514,
                      "top": 144,
                      "right": 624,
                      "bottom": 179,
                      "confidence": 998
                    },
                    {
                      "text": "landscape",
                      "left": 641,
                      "top": 143,
                      "right": 823,
                      "bottom": 189,
                      "confidence": 881
                    },
                    {
                      "text": "is",
                      "left": 838,
                      "top": 145,
                      "right": 861,
                      "bottom": 179,
                      "confidence": 996
                    },
                    {
                      "text": "always",
                      "left": 875,
                      "top": 144,
                      "right": 993,
                      "bottom": 189,
                      "confidence": 874
                    },
                    {
                      "text": "changing",
                      "left": 1007,
                      "top": 144,
                      "right": 1175,
                      "bottom": 189,
                      "confidence": 997
                    }
                  ]
                }
              ]
            }
          },
          {
            "region": {
              "language": "English",
              "orientation": "Up",
              "lines": [
                {
                  "text": "Video is the new currency",
                  "left": 395,
                  "top": 438,
                  "right": 679,
                  "bottom": 465,
                  "word": [
                    {
                      "text": "Video",
                      "left": 395,
                      "top": 438,
                      "right": 458,
                      "bottom": 459,
                      "confidence": 994
                    },
                    {
                      "text": "is",
                      "left": 467,
                      "top": 439,
                      "right": 481,
                      "bottom": 459,
                      "confidence": 992
                    },
…

Understanding the output

The Video OCR output provides time-segmented data on the characters found in your video. You can use attributes such as language or orientation to hone-in on exactly the words that you are interested in analyzing.

The output contains the following attributes:

timescale	“ticks” per second of the video
offset	time offset for timestamps. In version 1.0 of Video APIs, this will always be 0.
framerate	frames per second of the video
width	width of the video in pixels
height	height of the video in pixels
fragments	array of time-based chunks of video into which the metadata is chunked
start	start time of a fragment in “ticks”
duration	length of a fragment in “ticks”
interval	interval of each event within the given fragment
events	array containing regions
region	object representing detected words or phrases
language	language of the text detected within a region
orientation	orientation of the text detected within a region
lines	array of lines of text detected within a region
text	the actual text

If you want to learn more about this product, and the scenarios that it enables, read the introductory blog post on Azure Media OCR.

To learn more about Azure Media Analytics, check out the introductory blog post.

If you have any questions about any of the Media Analytics products, send an email to amsanalytics@microsoft.com.

Announcing: Video OCR Public Preview

Azure Media OCR Public Preview

Configuration

JSON

XML

Output

Understanding the output

Explore

Related posts

Enabling Diagnostic Logging in Azure API for FHIR®

Azure におけるインフラから SAP アプリケーションレイヤーまでの IRAP Protected コンプライアンス

MileIQ and Azure Event Hubs: Billions of miles streamed

Azure Stack IaaS – part ten

Join the conversation

おすすめ

AI + machine learning

分析

コンピューティング

コンテナー

データベース

DevOps

開発者ツール

ハイブリッド + マルチクラウド

ID

統合

モノのインターネット (IoT)

管理とガバナンス

メディア

移行

複合現実

モバイル

ネットワーク

セキュリティ

ストレージ

Web

Windows Virtual Desktop

ユース ケース

アプリケーション開発

AI

クラウドの移行とモダン化

データと分析

ハイブリッド クラウドとインフラストラクチャ

モノのインターネット (IoT)

セキュリティとガバナンス

組織の種類

リソース

Azure Media OCR Public Preview

Configuration

JSON

XML

Output

Understanding the output

Explore

Related posts

Join the conversation

ユースケース

ハイブリッドクラウドとインフラストラクチャ