Cloudera now supports Azure Data Lake Store

By Paige Liu Software Engineer, Azure

Cloudera now supports Azure Data Lake Store • 3 min read

Posted on April 20, 2017
3 min read

With the release of Cloudera Enterprise Data Hub 5.12, you can now run Spark, Hive, HBase, Impala, and MapReduce workloads in a Cloudera cluster on Azure Data Lake Store (ADLS). Running on ADLS has the following benefits:

Grow or shrink a cluster independent of the size of the data.
Data persists independently as you spin up or tear down a cluster. Other clusters and compute engines, such as Azure Data Lake Analytics or Azure SQL Data Warehouse, can execute workload on the same data.
Enable role-based access controls integrated with Azure Active Directory and authorize users and groups with fine-grained POSIX-based ACLs.
Cloud HDFS with performance optimized for analytics workload, supporting reading and writing hundreds of terabytes of data concurrently.
No limits on account size or individual file size.
Data is encrypted at rest by default using service-managed or customer-managed keys in Azure Key Vault, and is encrypted with SSL while in transit.
High data durability at lower cost as data replication is managed by Data Lake Store and exposed from HDFS compatible interface rather than having to replicate data both in HDFS and at the cloud storage infrastructure level.

To get started, you can use the Cloudera Enterprise Data Hub template or the Cloudera Director template on Azure Marketplace to create a Cloudera cluster. Once the cluster is up, use one or both of the following approaches to enable ADLS.

Add a Data Lake Store for cluster wide access

Step 1: ADLS uses Azure Active Directory for identity management and authentication. To access ADLS from a Cloudera cluster, first create a service principal in Azure AD. You will need the Application ID, Authentication Key, and Tenant ID of the service principal.

Step 2: To access ADLS, assign the permissions for the service principal created in the previous step. To do this, go to the Azure portal, navigate to the Data Lake Store, and select Data Explorer. Then navigate to the target path, select Access and add the service principal with appropriate access rights. Refer to this document for details on access control in ADLS.

Step 3: Go to Cloudera Manager -> HDFS -> Configuration. Add the following configurations to core-site.xml:

Use the service principal property values obtained from Step 1 to set these parameters:


    dfs.adls.oauth2.client.id
    Application ID


    dfs.adls.oauth2.credential
    Authentication Key


    dfs.adls.oauth2.refresh.url
    https://login.microsoftonline.com/<Tenant ID>/oauth2/token
 

    dfs.adls.oauth2.access.token.provider.type
    ClientCredential

Step 4: Verify you can access ADLS by running a Hadoop command, for example:

hdfs dfs -ls adl://.azuredatalakestore.net//

Specify a Data Lake Store in the Hadoop command line

Instead of, or in addition to, configuring a Data Lake Store for cluster wide access, you could also provide ADLS access information in the command line of a MapReduce or Spark job. With this method, if you use an Azure AD refresh token instead of a service principal, and encrypt the credentials in a .JCEKS file under a user’s home directory, you gain the following benefits:

Each user can use their own credentials instead of having a cluster wide credential
Nobody can see another user’s credential because it’s encrypted in .JCEKS in the user’s home directory
No need to store credentials in clear text in a configuration file
No need to wait for someone who has rights to create service principals in Azure AD

The following steps illustrate an example of how you can set this up by using the refresh token obtained by signing in to the Azure cross platform client tool.

Step 1: Sign in to Azure cli by running the command “azure login”, then get the refreshToken and _clientId from .azure/accessTokens.json under the user’s home directory.

Step 2: Run the following commands to set up credentials to access ADLS:

export HADOOP_CREDSTORE_PASSWORD= 
hadoop credential create dfs.adls.oauth2.client.id -value <_clientId from Step 1> -provider jceks://hdfs/user//cred.jceks 
hadoop credential create dfs.adls.oauth2.refresh.token -value ‘’ -provider jceks://hdfs/user//cred.jceks

Step 3: Verify you can access ADLS by running a Hadoop command, for example:

hdfs dfs -Ddfs.adls.oauth2.access.token.provider.type=RefreshToken -Dhadoop.security.credential.provider.path=jceks://hdfs/user//cred.jceks -ls adl://.azuredatalakestore.net/
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar teragen -Dmapred.child.env="HADOOP_CREDSTORE_PASSWORD=$HADOOP_CREDSTORE_PASSWORD" -Dyarn.app.mapreduce.am.env="HADOOP_CREDSTORE_PASSWORD=$HADOOP_CREDSTORE_PASSWORD" -Ddfs.adls.oauth2.access.token.provider.type=RefreshToken -Dhadoop.security.credential.provider.path=jceks://hdfs/user//cred.jceks 1000 adl://.azuredatalakestore.net/

Limitations of ADLS support in EDH 5.12

ADLS is supported as a secondary storage. To access ADLS, use fully qualified URLs in the form of adl://.azuredatalakestore.net/ .

Additional resources

Cloudera documentation on ADLS support

Cloudera now supports Azure Data Lake Store

Add a Data Lake Store for cluster wide access

Specify a Data Lake Store in the Hadoop command line

Limitations of ADLS support in EDH 5.12

Additional resources

Explore

Related posts

Enabling Diagnostic Logging in Azure API for FHIR®

Conformidad protegida por el programa IRAP desde la infraestructura hasta la capa de aplicaciones SAP en Azure

MileIQ and Azure Event Hubs: Billions of miles streamed

Join the conversation

Destacadas

IA y Machine Learning

Análisis

Compute

Contenedores

Bases de datos

DevOps

Herramientas para desarrolladores

Híbrido y multinube

Identidad

Integración

Internet de las cosas

Administración y Gobernanza

Multimedia

Migración

Realidad mixta

Movilidad

Redes

Seguridad

Almacenamiento

Web

Windows Virtual Desktop

Casos de uso

Desarrollo de aplicaciones

Inteligencia artificial

Migración y modernización en la nube

Datos y análisis

Nube e infraestructura híbridas

Internet de las cosas

Seguridad y gobernanza

Tipo de organización

Recursos

Add a Data Lake Store for cluster wide access

Specify a Data Lake Store in the Hadoop command line

Limitations of ADLS support in EDH 5.12

Additional resources

Explore

Related posts

Join the conversation