Data Processing components can be configured to run in a Hadoop cluster that has enabled Kerberos authentication.
The Kerberos Network Authentication Service version 5, defined in RFC 1510, provides a means of verifying the identities of principals in a Hadoop environment. Hadoop uses Kerberos to create secure communications among its various components and clients. Kerberos is an authentication mechanism, in which users and services that users want to access rely on the Kerberos server to authenticate each to the other. The Kerberos server is called the Key Distribution Center (KDC). At a high level, it has three parts:
The principal gets service tickets from the TGS. Service tickets are what allow a principal to access various Hadoop services.
All these BDD components share one principal and keytab. Note that there is no authorization support (that is, these components do not verify permissions for users).
The BDD components are enabled for Kerberos support at installation time, via the ENABLE_KERBEROS
parameter in the bdd.conf
file. The bdd.conf
file also has parameters for specifying the name of the Kerberos principal, as well as paths to the Kerberos keytab file and the Kerberos configuration file. For details on these parameters, see the Installation Guide.
Note:
If you use Sentry for authorization in your Hadoop cluster, you must configure it to grant BDD access to your Hive tables.Kerberos support in DP workflows
Support for Kerberos authentication ensures that Data Processing workflows can run on a secure Hadoop cluster. The support for Kerberos includes the DP CLI, via the Kerberos properties in the edp.properties
configuration file.
spark-submit
script in Spark's bin
directory is used to launch DP applications on a cluster, as follows:
spark-submit
, Data Processing logs in using the local keytab. The spark-submit
process grabs the Data Processing credentials during job submission to authenticate with YARN and Spark.spark.yarn.access.namenodes
property and this enables the Data Processing workflow to access HDFS.When a Hive JDBC connection is used, the credentials are used to authenticate with Hive, and thus be able to use the service.
Kerberos support in Dgraph and Dgraph HDFS Agent
In BDD, the Dgraph HDFS Agent is a client for Hadoop HDFS because it reads and writes HDFS files from and to HDFS. If your Dgraph databases are stored on HDFS, you must also enable Kerberos for the Dgraph.
bdd.conf
properties are set correctly:
KERBEROS_TICKET_REFRESH_INTERVAL
specifies the interval (in minutes) at which the Dgraph's Kerberos ticket is refreshed.KERBEROS_TICKET_LIFETIME
sets the amount of time that the Dgraph's Kerberos ticket is valid.See the Administrator's Guide for instructions on setting up the Dgraph for Kerberos support.
krb5.conf
configuration file.The values for the flag arguments are set by the installation script.
When started, the Dgraph HDFS Agent logs in with the specified principal and keytab. If the login is successful, the Dgraph HDFS Agent passed Kerberos authentication and starts up successfully. Otherwise, HDFS Agent cannot be started.
Kerberos support in Studio
portal-ext.properties
:
kerberos.principal
kerberos.keytab
kerberos.krb5.location
The values for these properties are inserted during the installation procedure for Big Data Discovery.