Kerberos authentication

Data Processing components can be configured to run in a Hadoop cluster that has enabled Kerberos authentication.

The Kerberos Network Authentication Service version 5, defined in RFC 1510, provides a means of verifying the identities of principals in a Hadoop environment. Hadoop uses Kerberos to create secure communications among its various components and clients. Kerberos is an authentication mechanism, in which users and services that users want to access rely on the Kerberos server to authenticate each to the other. The Kerberos server is called the Key Distribution Center (KDC). At a high level, it has three parts:

A database of the users and services (known as principals) and their respective Kerberos passwords
An authentication server (AS) which performs the initial authentication and issues a Ticket Granting Ticket (TGT)
A Ticket Granting Server (TGS) that issues subsequent service tickets based on the initial TGT

The principal gets service tickets from the TGS. Service tickets are what allow a principal to access various Hadoop services.

To ensure that Data Processing workflows can run on a secure Hadoop cluster, these BDD components are enabled for Kerberos support:

Dgraph and Dgraph HDFS Agent
Data Processing workflows (whether initiated by Studio or the DP CLI)
Studio

All these BDD components share one principal and keytab. Note that there is no authorization support (that is, these components do not verify permissions for users).

The BDD components are enabled for Kerberos support at installation time, via the ENABLE_KERBEROS parameter in the bdd.conf file. The bdd.conf file also has parameters for specifying the name of the Kerberos principal, as well as paths to the Kerberos keytab file and the Kerberos configuration file. For details on these parameters, see the Installation Guide.

Note: If you use Sentry for authorization in your Hadoop cluster, you must configure it to grant BDD access to your Hive tables.

Kerberos support in DP workflows

Support for Kerberos authentication ensures that Data Processing workflows can run on a secure Hadoop cluster. The support for Kerberos includes the DP CLI, via the Kerberos properties in the edp-cli.properties configuration file.

The spark-submit script in Spark's bin directory is used to launch DP applications on a cluster, as follows:

Before the call to spark-submit, Data Processing logs in using the local keytab. The spark-submit process grabs the Data Processing credentials during job submission to authenticate with YARN and Spark.
Spark gets the HDFS delegation tokens for the name nodes listed in the spark.yarn.access.namenodes property and this enables the Data Processing workflow to access HDFS.
When the workflow starts, the Data Processing workflow logs in using the Hadoop cluster keytab.
When the Data Processing Hive Client is initialized, a SASL client is used along with the Kerberos credentials on the node to authenticate with the Hive Metastore. Once authenticated, the Data Processing Hive Client can communicate with the Hive Metastore.

When a Hive JDBC connection is used, the credentials are used to authenticate with Hive, and thus be able to use the service.

Kerberos support in Dgraph and Dgraph HDFS Agent

In BDD, the Dgraph HDFS Agent is a client for Hadoop HDFS because it reads and writes HDFS files from and to HDFS. If your Dgraph databases are stored on HDFS, you must also enable Kerberos for the Dgraph.

For Kerberos support for the Dgraph, make sure these bdd.conf properties are set correctly:

KERBEROS_TICKET_REFRESH_INTERVAL specifies the interval (in minutes) at which the Dgraph's Kerberos ticket is refreshed.
KERBEROS_TICKET_LIFETIME sets the amount of time that the Dgraph's Kerberos ticket is valid.

See the Administrator's Guide for instructions on setting up the Dgraph for Kerberos support.

For Kerberos support, the Dgraph HDFS Agent will be started with three Kerberos flags:

The --principal flag specifies the name of the principal.
The --keytab flag specifies the path to the principal's keytab.
The --krb5conf flag specifies the path to the krb5.conf configuration file.

The values for the flag arguments are set by the installation script.

When started, the Dgraph HDFS Agent logs in with the specified principal and keytab. If the login is successful, the Dgraph HDFS Agent passed Kerberos authentication and starts up successfully. Otherwise, HDFS Agent cannot be started.

Kerberos support in Studio

Studio also has support for running the following jobs in a Hadoop Kerberos environment:

Transforming data sets
Uploading files
Export data

The Kerberos login is configured via the following properties in portal-ext.properties:

kerberos.principal
kerberos.keytab
kerberos.krb5.location

The values for these properties are inserted during the installation procedure for Big Data Discovery.