Passer au contenu principal

Announcing Video OCR Private Preview on Azure Media Analytics

Publié le 20 avril, 2016

Program Manager, Azure Media Services

Note: The following blog post describes a component of Azure Media Analytics. In order to learn more and learn how to get started, please read the introductory blog post on Azure Media Analytics.

Video OCR is now in Public Preview.  Read the follow-up blog to learn more here.

Video OCR

OCR (Optical Character Recognition) is the conversion of visual text from video into editable, searchable digital text. Video OCR detects text content in video files and generates text files for your use. This allows you to automate the extraction of meaningful metadata from the video signal of your media. 

When used in conjunction with a search engine, you can easily index your media by text, and enhance the discoverability of your content. This is extremely useful in highly textual video, like a video recording or screen-capture of a slideshow presentation.The Azure OCR Media Processor is optimized for digital text.

Today we are proud to announce the Azure OCR Media Processor for a private preview as part of Azure Media Analytics.

Sample video from Oceaneering

The engineering company Oceaneering routinely surveys and maps submarine oil and gas operations. As a result, they often have large volumes of video data with “burned-in” overlaid on the video stream, such as GPS coordinates, depth, and other meaningful tags. In order to efficiently index and organize their massive library of video, Oceaneering engineers would like to extract the text overlays, and Azure Media OCR gives them the ability to process their massive video content in the cloud at enterprise scale.

Let’s take a look at a real-world sample video from Oceaneering:

Note, in particular, the text overlays on the upper-most portion of the video frame. This section comprises valuable data tagging and labeling the video and its contents. Using the Azure Media OCR Media Processor, we are able to extract this text-based information from this video file. You can download the Full XML output, or simply see the truncated results below, generated under default settings for the first frame (timestamp 0:00) of the sample video above:

Truncated output

    <FrameTextData time="0.00">
      <TextRegion left="15" top="4" width="705" height="50">
        <TextData>  MD-13  Mad Dog  cp -148</TextData>
        <TextLine left="15" top="4" width="51" height="13">
          <TextData> MD-13</TextData>
          <TextBlob left="15" top="4" width="51" height="13" confidence="464" language="6">
            <TextData>MD-13</TextData>
          </TextBlob>
        </TextLine>
        <TextLine left="15" top="38" width="68" height="16">
          <TextData> Mad Dog</TextData>
          <TextBlob left="15" top="38" width="30" height="14" confidence="494" language="6">
            <TextData>Mad</TextData>
          </TextBlob>
          <TextBlob left="53" top="38" width="30" height="16" confidence="698" language="6">
            <TextData>Dog</TextData>
          </TextBlob>
        </TextLine>
        <TextLine left="659" top="38" width="61" height="13">
          <TextData> cp -148</TextData>
          <TextBlob left="659" top="38" width="19" height="13" confidence="424" language="6">
            <TextData>cp</TextData>
          </TextBlob>
          <TextBlob left="685" top="38" width="35" height="13" confidence="603" language="6">
            <TextData>-148</TextData>
          </TextBlob>
        </TextLine>
      </TextRegion>
      <TextRegion left="332" top="4" width="54" height="47">
        <TextData>  Drilling  33</TextData>
        <TextLine left="332" top="4" width="54" height="16">
          <TextData> Drilling</TextData>
          <TextBlob left="332" top="4" width="54" height="16" confidence="594" language="6">
            <TextData>Drilling</TextData>
          </TextBlob>
        </TextLine>
        <TextLine left="333" top="38" width="18" height="13">
          <TextData> 33</TextData>
          <TextBlob left="333" top="38" width="18" height="13" confidence="355" language="6">
            <TextData>33</TextData>
          </TextBlob>
        </TextLine>
      </TextRegion>
      <TextRegion left="394" top="4" width="38" height="13">
        <TextData>  Riser</TextData>
        <TextLine left="394" top="4" width="38" height="13">
          <TextData> Riser</TextData>
          <TextBlob left="394" top="4" width="38" height="13" confidence="606" language="6">
            <TextData>Riser</TextData>
          </TextBlob>
        </TextLine>
      </TextRegion>
      <TextRegion left="885" top="4" width="83" height="47">
        <TextData>  8/12/2012  59</TextData>
        <TextLine left="885" top="4" width="83" height="15">
          <TextData> 8/12/2012</TextData>
          <TextBlob left="885" top="4" width="83" height="15" confidence="524" language="6">
            <TextData>8/12/2012</TextData>
          </TextBlob>
        </TextLine>
        <TextLine left="940" top="38" width="18" height="13">
          <TextData> 59</TextData>
          <TextBlob left="940" top="38" width="18" height="13" confidence="333" language="6">
            <TextData>59</TextData>
          </TextBlob>
        </TextLine>
      </TextRegion>
      <TextRegion left="1004" top="4" width="112" height="47">
        <TextData>  E 2528185.  79  N 9875573.57</TextData>
        <TextLine left="1004" top="4" width="88" height="13">
          <TextData> E 2528185.</TextData>
          <TextBlob left="1004" top="4" width="8" height="13" confidence="562" language="6">
            <TextData>E</TextData>
          </TextBlob>
          <TextBlob left="1019" top="4" width="73" height="13" confidence="603" language="6">
            <TextData>2528185.</TextData>
          </TextBlob>
        </TextLine>
        <TextLine left="1096" top="4" width="17" height="13">
          <TextData> 79</TextData>
          <TextBlob left="1096" top="4" width="17" height="13" confidence="317" language="6">
            <TextData>79</TextData>
          </TextBlob>
        </TextLine>
        <TextLine left="1004" top="38" width="112" height="13">
          <TextData> N 9875573.57</TextData>
          <TextBlob left="1004" top="38" width="10" height="13" confidence="566" language="6">
            <TextData>N</TextData>
          </TextBlob>
          <TextBlob left="1021" top="38" width="95" height="13" confidence="669" language="6">
            <TextData>9875573.57</TextData>
          </TextBlob>
        </TextLine>
      </TextRegion>
      <TextRegion left="1140" top="4" width="82" height="48">
        <TextData>  H 2.13  D 2245.90</TextData>
        <TextLine left="1140" top="4" width="51" height="13">
          <TextData> H 2.13</TextData>
          <TextBlob left="1140" top="4" width="10" height="13" confidence="608" language="6">
            <TextData>H</TextData>
          </TextBlob>
          <TextBlob left="1157" top="4" width="34" height="13" confidence="608" language="6">
            <TextData>2.13</TextData>
          </TextBlob>
        </TextLine>
        <TextLine left="1140" top="39" width="82" height="13">
          <TextData> D 2245.90</TextData>
          <TextBlob left="1140" top="39" width="11" height="12" confidence="536" language="6">
            <TextData>D</TextData>
          </TextBlob>
          <TextBlob left="1158" top="39" width="64" height="13" confidence="427" language="6">
            <TextData>2245.90</TextData>
          </TextBlob>
        </TextLine>
      </TextRegion>
    </FrameTextData>

With this XML file, Oceaneering is able to automate the tagging of their multimedia without investing in the rearchitecture required to pass this overlay data through their post-processing pipeline. They can easily index videos (or segments of videos) that represent views at a certain depth or in a certain geographical vicinity based on the overlaid position data on their camera feeds.

Sample enterprise video

Oceaneering, however, represents a niche scenario for Video OCR. A more common use case would be the extraction of text data from PowerPoint slides in a recorded lecture.

Check out the following clip of an Azure Media Services presentation at //Build.

From this, we were able to extract all of the text (except for the //Build logo): 

Digital media landscape is always changing
Huge capital investment required Delivering video is hard, expensive, time - consuming, with a need for high scale and high availability especially hard and costly as both audience sizes and content libraries grow and shrink and grow again.
Delivering video is hard, expensive, time - consuming, with a need for
high scale and high availability especially hard and costly as both
audience sizes and content libraries grow and shrink and grow
Video is the new currency Audiences of all kinds are changing and demanding content on their own devices, wherever they are.That isn 't easy: So many different device profiles and different delivery technologies.
Audiences of all kinds are changing and demanding content
on their own devices, wherever they are.That isn 't easy: So
many different device profiles and different delivery
Azure Media Services Microsoft's cloud platform enables on demand and live streaming video solutions for consumer and enterprise scenarios.
Microsoft's cloud platform enables on
demand and live streaming video solutions
for consumer and enterprise scenarios.

This shows the power of Video OCR in processing slideshow presentations. Ideally every enterprise or educational institution should be able to index the valuable data contained in their presentation recordings, easily enabling search and discovery.

The configuration preset for this private preview version of Video OCR includes the following:

TimeInterval Integer greater than or equal to 0.

Specifies the sampling frequency for OCR.  A value of 1.5 would sample one frame every 1.5 seconds.

Default is 0 (samples every frame).
Language One of the following strings:

Arabic, Chinese Simplified, Chinese Traditional, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian Cyrillic, Serbian Latin, Slovak, Spanish, Swedish, and Turkish.

Getting started

To gain access to our Private Preview of OCR, sign up on the Azure website here.

Once you are admitted access to the private preview, you can submit jobs with the following configuration and Media Processor name:

Note: This blog post is out of date.  For the most accurate information, check out the documentation page for Azure Media OCR or a more recent documentation page..

task configuration

<?xml version="1.0" encoding="utf-8"?>

<configuration version="2.0">
  <input>
    <metadata key="title" value="" />
    <metadata key="description" value="" />
  </input>
  <features>
    <feature name="ocr">
      <settings>       
        <add key="OutputFormats" value="txt|xml" />
        <add key="Language" value="english"/>
        <add key="TimeInterval" value="5"/>
      </settings>
    </feature>
  </features>
  <settings>

  </settings>
</configuration>


 
Media Processor name “Azure Media OCR”

 

Video OCR is now in Public Preview.  Read the follow-up blog to learn more here.

To learn more about Azure Media Analytics, check out the introductory blog post.

If you have any questions about any of the Media Analytics products, send an email to amsanalytics@microsoft.com.