Using Notebooks

You can use notebooks to explore and visualize data. This section describes how to install Jupyter notebooks and how to use the Big Data Studio notebooks in Oracle Big Data.

Notebooks are web-based platforms for data scientists. They are interactive environments for running code. They support libraries, graph analytics, and visualizations that accelerate exploring and gaining insights from data.

For Oracle Distribution including Apache Hadoop (ODH) and Cloudera Distribution including Apache Hadoop (CDH), you have the following notebook options.

  • Jupyter Notebooks only available for ODH
    You can install Jupyter on your ODH cluster nodes and access it through a browser.
  • Big Data Studio Notebooks available for ODH and CDH

    When you create a cluster, Big Data Studio is installed and configured on your cluster nodes.

You can import data to your notebooks from sources such as HDFS or Spark databases and files. You can then analyze the data with interpreter environments for languages such as Python, PySpark, and Spark.

Installing Jupyter for ODH

To set up Jupyter notebooks on ODH, you must install both PySpark and Jupyter.

Setting Up PySpark Integration

To integrate Jupyter and PySpark, install the findspark application.

ODH cluster nodes include Python 3, Apache Spark3 client, and PySpark.

PySpark is an interface for Apache Spark in Python. With PySpark, you can write Spark applications using Python APIs. The PySpark shell is an environment for analyzing data in a distributed environment.

The findspark application finds and adds PySpark to the system path. This way, Jupyter, and PySpark integrate seamlessly.

  1. Access your ODH cluster node:
    • The second utility node of an HA (highly available) cluster.

    • The first (and only) utility node of a non-HA cluster.

  2. Install Python for Java.
    sudo python3 -m pip install py4j
  3. Install findspark.
    sudo python3 -m pip install findspark

Installing Jupyter

Install Jupyter on the same node as the one you set up for PySpark integration.

  1. Install Jupyter.
    sudo python3 -m pip install jupyter
  2. Upgrade Pygments package.
    $ pip3 install --upgrade Pygments
  3. Check Jupyter install location.
    $ which jupyter
    /usr/local/bin/jupyter
  4. Check Kernels available.
    $ /usr/local/bin/jupyter kernelspec list
    Available kernels:
      python3    /usr/local/share/jupyter/kernels/python3
  5. Check Jupyter package versions.
    $ /usr/local/bin/jupyter --version
    Selected Jupyter core packages...
    IPython          : 7.16.2
    ipykernel        : 5.5.6
    ipywidgets       : 7.6.5
    jupyter_client   : 7.1.0
    jupyter_core     : 4.9.1
    jupyter_server   : not installed
    jupyterlab       : not installed
    nbclient         : 0.5.9
    nbconvert        : 6.0.7
    nbformat         : 5.1.3
    notebook         : 6.4.6
    qtconsole        : 5.2.2
    traitlets        : 4.3.3
  6. Request Kerberos ticket.
    kinit -kt <spark-user-keytabfile> <principle>
    keyTab File Location: /etc/security/keytabs/**.keytab
    Example
    $ kinit -kt /etc/security/keytabs/spark.headless.keytab spark-trainingcl@BDACLOUDSERVICE.ORACLE.COM

    Kerberos ticket is applicable only to highly available clusters. You must request a Kerberos ticket with the appropriate user that has Ranger permissions on HDFS, yarn, and so on. This ticket is valid for 24 hours only.

    For non-highly available clusters, ranger permissions and kerberos ticket are not required.

  7. Launch Jupyter from the utility node.
    <jupyter-location> notebook --ip=0.0.0.0 --allow-root

    Example:

    /usr/local/bin/jupyter notebook --ip=0.0.0.0 --allow-root

    Example output:

    [xxxx NotebookApp] To access the notebook, open this file in a browser:
    file:////xxxx
    Or copy and paste one of these URLs:
    xxxx
    or http://<some link>
    or http://127.0.0.1:8888/?token=<your-token>
  8. From the output, copy the URL for the notebook, and replace 127.0.0.1 with the public IP address of the utility node.
    http://<utility-node-public-ip-address>:8888/?token=<your-token>
  9. Run the following commands in your notebook.
    import findspark
    findspark.init()
    import pyspark
    from pyspark.sql import SparkSession
    spark = SparkSession \
        .builder \
        .enableHiveSupport() \
        .appName("ODH-ML-WorkBench") \
        .getOrCreate()
  10. Test by getting the Spark version:
    spark.version

    Example output:

    '3.0.2'

Troubleshooting

Set up the Hive configuration in Apache Ambari to avoid Hive exception errors.

If you get the following exception message, when you create a Hive table:

AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table odh.emp failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.);

Then follow these steps:

  1. Go through Apache Ambari prerequisites.
  2. Accessing Apache Ambari.
  3. In the list of Services, click Hive.
  4. Click the Advanced tab.
  5. In the Advanced hive-interactive-site and the Advanced hive-site sections, set the following field to false:
    hive.strict.managed.tablesA screenshot of the Advanced settings of Apache Ambari dashboard's Hive page. In the Advanced hive-interactive-site and the                     Advanced hive-site sections, the hive.strict.managed.tables field is set to false.