Support for Kerberos authentication in Hadoop

Data Processing components can be configured to run in a cluster that has enabled Kerberos authentication.

The Kerberos Network Authentication Service version 5, defined in RFC 1510, provides a means of verifying the identities of principals in a Hadoop environment. Hadoop uses Kerberos to create secure communications among its various components and clients. Kerberos is an authentication mechanism, in which users and services that users want to access rely on the Kerberos server to authenticate each to the other. The Kerberos server is called the Key Distribution Center (KDC). At a high level, it has three parts:

The principal gets service tickets from the TGS. Service tickets are what allow a principal to access various Hadoop services.

To ensure that Data Processing workflows can run on a secure Hadoop cluster, these three BDD components are enabled for Kerberos support:

All three BDD components share one principal and keytab. Note that there is no authorization support (that is, these components do not verify permissions for users).

The BDD components are enabled for Kerberos support at installation time, via the ENABLE_KERBEROS parameter in the bdd.conf file. The bdd.conf file also has parameters for specifying the name of the Kerberos principal, as well as paths to the Kerberos keytab file and the Kerberos configuration file. For details on these parameters, see the Installation and Deployment Guide.

Note: If you use Sentry for authorization in your Hadoop cluster, you must configure it to grant BDD access to your Hive tables.

Kerberos support in DP workflows

Support for Kerberos authentication ensures that Data Processing workflows can run on a secure Hadoop cluster. The support for Kerberos includes the DP CLI, via the Kerberos properties in the edp.properties configuration file.

The spark-submit script in Spark's bin directory is used to launch DP applications on a cluster, as follows:
  1. Prior to the call to spark-submit, DP logs in using the local keytab. spark-submit grabs our credentials during job submission to authenticate with YARN and Spark.
  2. Spark gets the HDFS delegation tokens for the name nodes listed in the spark.yarn.access.namenodes property and the workflow is able to access HDFS.
  3. When the workflow starts, DP logs in using the cluster keytab.
  4. When the DP Hive Client is initialized, a SASL client is used along with the Kerberos credentials on the node to authenticate with the Hive Metastore. Once authenticated, the DP Hive Client can communicate with the Hive Metastore.

When a Hive JDBC connection is used, the credentials are used to authenticate with Hive, and thus be able to use the service.

Kerberos support in Dgraph HDFS Agent

In BDD, the Dgraph HDFS Agent is a client for Hadoop HDFS because it reads and writes HDFS files from and to HDFS. For Kerberos support, the Dgraph HDFS Agent will be started with three Kerberos flags:
  • The --principal flag specifies the name of the principal.
  • The --keytab flag specifies the path to the principal's keytab.
  • The --krb5conf flag specifies the path to the krb5.conf configuration file.

The values for the flag arguments are set by the installation script.

When started, the Dgraph HDFS Agent logs in with the specified principal and keytab. If the login is successful, the Dgraph HDFS Agent passed Kerberos authentication and starts up successfully. Otherwise, HDFS Agent cannot be started.

Kerberos support in Studio

Studio also has support for running the following jobs in a Hadoop Kerberos environment:
  • Transforming data sets
  • Uploading files
  • Export data
The Kerberos login is configured via the following properties in portal-ext.properties:
  • kerberos.principal
  • kerberos.keytab
  • kerberos.krb5.location

The values for these properties are inserted during the installation procedure.

Kerberos support for bdd-admin commands

In addition to support for the components listed above, the following bdd-admin script commands work in a Kerberos-enabled environment:
  • get-logs command
  • backup command
  • restore command

For details on those commands, see the Administrator's Guide.