Data Processing components can be configured to run in a Hadoop
cluster that has enabled Kerberos authentication.
The Kerberos
Network Authentication Service version 5, defined in RFC 1510, provides a means
of verifying the identities of principals in a Hadoop environment. Hadoop uses
Kerberos to create secure communications among its various components and
clients. Kerberos is an authentication mechanism, in which users and services
that users want to access rely on the Kerberos server to authenticate each to
the other. The Kerberos server is called the Key Distribution Center (KDC). At
a high level, it has three parts:
- A database of the users and
services (known as principals) and their respective Kerberos passwords
- An authentication server
(AS) which performs the initial authentication and issues a Ticket Granting
Ticket (TGT)
- A Ticket Granting Server
(TGS) that issues subsequent service tickets based on the initial TGT
The principal gets service tickets from the TGS. Service tickets are
what allow a principal to access various Hadoop services.
To ensure that Data Processing workflows can run on a secure Hadoop
cluster, these BDD components are enabled for Kerberos support:
- Dgraph and Dgraph HDFS Agent
- Data Processing workflows
(whether initiated by Studio or the DP CLI)
- Studio
All these BDD components share one principal and keytab. Note that there
is no authorization support (that is, these components do not verify
permissions for users).
The BDD components are enabled for Kerberos support at installation
time, via the
ENABLE_KERBEROS parameter in the
bdd.conf file. The
bdd.conf file also has parameters for specifying the
name of the Kerberos principal, as well as paths to the Kerberos keytab file
and the Kerberos configuration file. For details on these parameters, see the
Installation Guide.
Note: If you use Sentry for authorization in your Hadoop cluster, you must
configure it to grant BDD access to your Hive tables.
Kerberos support in DP workflows
Support for Kerberos authentication ensures that
Data Processing workflows can run on a secure Hadoop cluster. The support for
Kerberos includes the DP CLI, via the Kerberos properties in the
edp-cli.properties configuration file.
The
spark-submit script in Spark's
bin directory is used to launch DP applications on
a cluster, as follows:
- Before the call to
spark-submit, Data Processing logs in using the
local keytab. The
spark-submit process grabs the Data Processing
credentials during job submission to authenticate with YARN and Spark.
- Spark gets the HDFS
delegation tokens for the name nodes listed in the
spark.yarn.access.namenodes property and this
enables the Data Processing workflow to access HDFS.
- When the workflow starts,
the Data Processing workflow logs in using the Hadoop cluster keytab.
- When the Data Processing
Hive Client is initialized, a SASL client is used along with the Kerberos
credentials on the node to authenticate with the Hive Metastore. Once
authenticated, the Data Processing Hive Client can communicate with the Hive
Metastore.
When a Hive JDBC connection is used, the credentials are used to
authenticate with Hive, and thus be able to use the service.
Kerberos support in Dgraph and Dgraph HDFS Agent
In BDD, the Dgraph HDFS Agent is a client for
Hadoop HDFS because it reads and writes HDFS files from and to HDFS. If your
Dgraph databases are stored on HDFS, you must also enable Kerberos for the
Dgraph.
For Kerberos support for the Dgraph, make sure these
bdd.conf properties are set correctly:
- KERBEROS_TICKET_REFRESH_INTERVAL
specifies the interval (in minutes) at which the Dgraph's Kerberos ticket is
refreshed.
- KERBEROS_TICKET_LIFETIME
sets the amount of time that the Dgraph's Kerberos ticket is valid.
See the
Administrator's Guide for instructions on setting up the
Dgraph for Kerberos support.
For Kerberos support, the Dgraph HDFS Agent will be started with three
Kerberos flags:
- The
--principal flag specifies the name of the
principal.
- The
--keytab flag specifies the path to the
principal's keytab.
- The
--krb5conf flag specifies the path to the
krb5.conf configuration file.
The values for the flag arguments are set by the installation script.
When started, the Dgraph HDFS Agent logs in with the specified
principal and keytab. If the login is successful, the Dgraph HDFS Agent passed
Kerberos authentication and starts up successfully. Otherwise, HDFS Agent
cannot be started.
Kerberos support in Studio
Studio also has support for running the
following jobs in a Hadoop Kerberos environment:
- Transforming data sets
- Uploading files
- Export data
The Kerberos login is configured via the following properties in
portal-ext.properties:
- kerberos.principal
- kerberos.keytab
- kerberos.krb5.location
The values for these properties are inserted during the installation
procedure for Big Data Discovery.