• 3 min read

Easily Transform JSON Data When Importing to Azure DocumentDB

The Azure DocumentDB data migration tool allows you to easily transform data during import.

The Azure DocumentDB Data Migration Tool is an open source solution that imports data to DocumentDB, Azure's NoSQL document database service.  Hopefully you already know the tool (available on GitHub or the Microsoft Download Center) supports importing data to DocumentDB from a variety of sources, including JSON files, CSV files, SQL Server, MongoDB, Azure Table storage, Amazon DynamoDB, HBase and DocumentDB collections, but did you know the tool also supports custom transformation of data during import?  Let’s see how easy it is to apply custom transformations to data during the import process.

When importing data to DocumentDB using the migration tool, the default option is bulk import. This leverages a DocumentDB JavaScript stored procedure to import data as efficiently and quickly as possible.

DataMigrationTool_1

Two stored procedures are available in the migration tool directory, BulkInsert.js and BulkTransformationInsert.js, both of which perform bulk import operations.

BulkInsert.js is the default import stored procedure the tool uses, but you can specify a custom stored procedure you wish to use in the Advanced Options section.

DataMigrationTool_2

The difference between BulkInsert.js and BulkTransformationInsert.js is the latter includes a transformDocument function “stub” you can use to apply custom transformation logic to every document being imported.

    function transformDocument(doc) {
        /*
            Add custom tranformation code here. You can either modify an existing document and
            return it or create a new one.

            Example #1, rename "ExternalId" field to "id":

                doc["id"] = doc["ExternalId"];
                delete doc.ExternalId;

            Example #2, convert string field to an array:

                if (doc["Cities"]) {
                    doc["Cities"] = doc["Cities"].split("|");
                }
        */
        return doc;
    }

You can use BulkTransformationInsert.js as a template, modify it as you see fit, then have the migration tool use your custom stored procedure during import.  Let’s walk through an example.

The city of Chicago has a food inspection data set* available in CSV format.  If you perform a default import of this data set with the migration tool, a typical document looks like this:

{
  "id": "1575294321",
  "DBA Name": "CONTOSO CO.",
  "AKA Name": "CONTOSO CO.",
  "License #": 2423456321,
  "Facility Type": "Restaurant",
  "Risk": "Risk 1 (High)",
  "Address": "3141 N FABRIKAM ST",
  "City": "CHICAGO",
  "State": "IL",
  "Zip": 60657,
  "Inspection Date": "2015-09-09T00:00:00.0000000Z",
  "Inspection Type": "License",
  "Results": "Pass",
  "Violations": "30. FOOD IN ORIGINAL CONTAINER, PROPERLY LABELED: CUSTOMER ADVISORY POSTED AS NEEDED - Comments: OBSERVED WRAPPED TO GO DESSERTS AT THE FRONT COUNTER UNLABELED. INSTRUCTED FACILITY TO LABEL AND MAINTAIN WITH PRODUCTION FACILITY NAME AND CONTACT INFORMATION. | 34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: INSTRUCTED FACILITY TO INSTALL AND MAINTAIN COVING ALONG THE WALLS IN THE PREP AREA. | 36. LIGHTING: REQUIRED MINIMUM FOOT-CANDLES OF LIGHT PROVIDED, FIXTURES SHIELDED - Comments: INSTRUCTED FACILITY TO INSTALL LIGHT SHIELDS OR TO OBTAIN SHATTER PROOF LIGHT BULBS AND TO MAINTAIN."
}

Note the Violations property.  It has a number of violations, with each violation separated by a ‘|’ character.  We can use a custom transformation to store the list of violations as a proper JSON array within each document.  We’ll modify the transformDocument function of the BulkTransformationInsert.js to do this and save the modified script as TransformChicagoDataset.js.

    function transformDocument(doc) {
  if (doc["Violations"]) {
   doc["Violations"] = doc["Violations"].split(" | ");
  }
        return doc;
    }

And we’ll specify the modified stored procedure during import.

DataMigrationTool_3

After performing the import with the modified stored procedure, a typical document now looks like this.

{
  "id": "1575294321",
  "DBA Name": "CONTOSO CO.",
  "AKA Name": "CONTOSO CO.",
  "License #": 2423456321,
  "Facility Type": "Restaurant",
  "Risk": "Risk 1 (High)",
  "Address": "3141 N FABRIKAM ST",
  "City": "CHICAGO",
  "State": "IL",
  "Zip": 60657,
  "Inspection Date": "2015-09-09T00:00:00.0000000Z",
  "Inspection Type": "License",
  "Results": "Pass",
  "Violations": [
    "30. FOOD IN ORIGINAL CONTAINER, PROPERLY LABELED: CUSTOMER ADVISORY POSTED AS NEEDED - Comments: OBSERVED WRAPPED TO GO DESSERTS AT THE FRONT COUNTER UNLABELED. INSTRUCTED FACILITY TO LABEL AND MAINTAIN WITH PRODUCTION FACILITY NAME AND CONTACT INFORMATION.",
    "34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: INSTRUCTED FACILITY TO INSTALL AND MAINTAIN COVING ALONG THE WALLS IN THE PREP AREA.",
    "36. LIGHTING: REQUIRED MINIMUM FOOT-CANDLES OF LIGHT PROVIDED, FIXTURES SHIELDED - Comments: INSTRUCTED FACILITY TO INSTALL LIGHT SHIELDS OR TO OBTAIN SHATTER PROOF LIGHT BULBS AND TO MAINTAIN."
  ]
}

Note the Violations property now contains a proper JSON array of violations, with each array member representing an individual violation.

We hope you can find lots of creative uses for this functionality.  Please leave a comment below with your biggest takeaway from the content above. We want to hear your perspective and feedback because it helps us continue to provide relevant information and features.

If you need any help or have questions or feedback, please reach out to us on the developer forums on stack overflow or schedule a 1:1 chat with the DocumentDB engineering team.  Make sure to stay up-to-date on the latest DocumentDB news and features by following us on Twitter @DocumentDB.

 

*The City of Chicago makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site.  The data provided at this site is subject to change at any time.  It is understood that the data provided at this site is being used at one’s own risk.