Data Processing components can be configured to run in a cluster
that has enabled Kerberos authentication.
The Kerberos
Network Authentication Service version 5, defined in RFC 1510, provides a means
of verifying the identities of principals in a Hadoop environment. Hadoop uses
Kerberos to create secure communications among its various components and
clients. Kerberos is an authentication mechanism, in which users and services
that users want to access rely on the Kerberos server to authenticate each to
the other. The Kerberos server is called the Key Distribution Center (KDC). At
a high level, it has three parts:
- A database of the users and
services (known as principals) and their respective Kerberos passwords
- An authentication server
(AS) which performs the initial authentication and issues a Ticket Granting
Ticket (TGT)
- A Ticket Granting Server
(TGS) that issues subsequent service tickets based on the initial TGT
The principal gets service tickets from the TGS. Service tickets are
what allow a principal to access various Hadoop services.
To ensure that Data Processing workflows can run on a secure Hadoop
cluster, these three BDD components are enabled for Kerberos support:
- Dgraph HDFS Agent
- Data Processing workflows
(whether initiated by Studio or the DP CLI)
- Studio
All three BDD components share one principal and keytab. Note that there
is no authorization support (that is, these components do not verify
permissions for users).
The BDD components are enabled for Kerberos support at installation
time, via the
ENABLE_KERBEROS parameter in the
bdd.conf file. The
bdd.conf file also has parameters for specifying the
name of the Kerberos principal, as well as paths to the Kerberos keytab file
and the Kerberos configuration file. For details on these parameters, see the
Installation and Deployment Guide.
Note: If you use Sentry for authorization in your Hadoop cluster, you must
configure it to grant BDD access to your Hive tables.
Kerberos support in DP workflows
Support for Kerberos authentication ensures that
Data Processing workflows can run on a secure Hadoop cluster. The support for
Kerberos includes the DP CLI, via the Kerberos properties in the
edp.properties configuration file.
The
spark-submit script in Spark's
bin directory is used to launch DP applications on
a cluster, as follows:
- Prior to the call to
spark-submit, DP logs in using the local keytab.
spark-submit grabs our credentials during job
submission to authenticate with YARN and Spark.
- Spark gets the HDFS
delegation tokens for the name nodes listed in the
spark.yarn.access.namenodes property and the
workflow is able to access HDFS.
- When the workflow starts,
DP logs in using the cluster keytab.
- When the DP Hive Client is
initialized, a SASL client is used along with the Kerberos credentials on the
node to authenticate with the Hive Metastore. Once authenticated, the DP Hive
Client can communicate with the Hive Metastore.
When a Hive JDBC connection is used, the credentials are used to
authenticate with Hive, and thus be able to use the service.
Kerberos support in Dgraph HDFS Agent
In BDD, the Dgraph HDFS Agent is a client for
Hadoop HDFS because it reads and writes HDFS files from and to HDFS. For
Kerberos support, the Dgraph HDFS Agent will be started with three Kerberos
flags:
- The
--principal flag specifies the name of the
principal.
- The
--keytab flag specifies the path to the
principal's keytab.
- The
--krb5conf flag specifies the path to the
krb5.conf configuration file.
The values for the flag arguments are set by the installation script.
When started, the Dgraph HDFS Agent logs in with the specified
principal and keytab. If the login is successful, the Dgraph HDFS Agent passed
Kerberos authentication and starts up successfully. Otherwise, HDFS Agent
cannot be started.
Kerberos support in Studio
Studio also has support for running the
following jobs in a Hadoop Kerberos environment:
- Transforming data sets
- Uploading files
- Export data
The Kerberos login is configured via the following properties in
portal-ext.properties:
- kerberos.principal
- kerberos.keytab
- kerberos.krb5.location
The values for these properties are inserted during the installation
procedure.
Kerberos support for bdd-admin commands
In addition to support for the components listed above, the following
bdd-admin script commands work in a Kerberos-enabled
environment:
-
get-logs command
- backup
command
- restore
command
For details on those commands, see the
Administrator's Guide.