Using S3 Credentials with YARN, MapReduce, or Spark
This topic describes how to access data stored in S3 for applications that use YARN, MapReduce, or Spark.
You can also copy data using the Hadoop distcp command. See Using DistCp with Amazon S3.
Continue reading:
Referencing Credentials for Clients Using the Amazon S3 Service
If you have selected IAM authentication, no additional steps are needed. If you are not using IAM authentication, use one of the following three options to provide Amazon S3
credentials to clients:
Note: This method of specifying AWS credentials to clients does not completely distribute secrets
securely because the credentials are not encrypted. Use caution when operating in a multi-tenant environment.
- Programmatic
- Specify the credentials in the configuration for the job. This option is most useful for Spark jobs.
- Make a modified copy of the configuration files
- Make a copy of the configuration files and add the S3 credentials:
- For YARN and MapReduce jobs, copy the contents of the /etc/hadoop/conf directory to a local working directory under the home directory of the host where you will submit the job. For Spark jobs, copy /etc/spark/conf to a local directory under the home directory of the host where you will submit the job.
- Set the permissions for the configuration files appropriately for your environment and ensure that unauthorized users cannot access sensitive configurations in these files.
- Add the following to the core-site.xml file within the <configuration> element:
<property> <name>fs.s3a.access.key</name> <value>Amazon S3 Access Key</value> </property> <property> <name>fs.s3a.secret.key</name> <value>Amazon S3 Secret Key</value> </property>
- Reference these versions of the configuration files when submitting jobs by running the following command:
- YARN or MapReduce:
export HADOOP_CONF_DIR=path to local configuration directory
- Spark:
export SPARK_CONF_DIR=path to local configuration directory
- YARN or MapReduce:
Note: If you update the client configuration files from Cloudera Manager, you must repeat these steps to use the new configurations. - Reference the managed configuration files and add AWS credentials
- This option allows you to continue to use the configuration files managed by Cloudera Manager. If you deploy new configuration files, the new values are included by reference in your
copy of the configuration files while also maintaining a version of the configuration that contains the Amazon S3 credentials:
- Create a local directory under your home directory.
- Copy the configuration files from /etc/hadoop/conf to the new directory.
- Set the permissions for the configuration files appropriately for your environment.
- Edit each configuration file:
- Remove all elements within the <configuration> element.
- Add an XML <include> element within the <configuration> element to reference the configuration files managed
by Cloudera Manager. For example:
<include xmlns="http://www.w3.org/2001/XInclude" href="/etc/hadoop/conf/hdfs-site.xml"> <fallback /> </include>
- Add the following to the core-site.xml file within the <configuration> element:
<property> <name>fs.s3a.access.key</name> <value>Amazon S3 Access Key</value> </property> <property> <name>fs.s3a.secret.key</name> <value>Amazon S3 Secret Key</value> </property>
- Reference these versions of the configuration files when submitting jobs by running the following command:
- YARN or MapReduce:
export HADOOP_CONF_DIR=path to local configuration directory
- Spark:
export SPARK_CONF_DIR=path to local configuration directory
- YARN or MapReduce:
Example core-site.xml file:<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <include xmlns="http://www.w3.org/2001/XInclude" href="/etc/hadoop/conf/core-site.xml"> <fallback /> </include> <property> <name>fs.s3a.access.key</name> <value>Amazon S3 Access Key</value> </property> <property> <name>fs.s3a.secret.key</name> <value>Amazon S3 Secret Key</value> </property> </configuration>
Referencing Amazon S3 in URIs
By default, files are still placed on the local HDFS and not on S3 if the protocol is not specified in the URI. When you have added the Amazon S3 service, use one of the following
options to construct the URIs to reference when submitting jobs:
- Amazon S3:
s3a://bucket_name/path
- HDFS:
hdfs://path
or/path
For more information about using Impala, Hive, and Spark on S3, see:
Page generated May 18, 2018.
<< Configuring the Amazon S3 Connector | ©2016 Cloudera, Inc. All rights reserved | Using Fast Upload with Amazon S3 >> |
Terms and Conditions Privacy Policy |