Using Notebooks
You can use notebooks to explore and visualize data. This section describes how to install Jupyter notebooks and how to use the Big Data Studio notebooks in Oracle Big Data.
Notebooks are web-based platforms for data scientists. They are interactive environments for running code. They support libraries, graph analytics, and visualizations that accelerate exploring and gaining insights from data.
For Oracle Distribution including Apache Hadoop (ODH) and Cloudera Distribution including Apache Hadoop (CDH), you have the following notebook options.
-
- Jupyter Notebooks only available for ODH
- You can install Jupyter on your ODH cluster nodes and access it through a browser.
-
- Big Data Studio Notebooks available for ODH and CDH
-
When you create a cluster, Big Data Studio is installed and configured on your cluster nodes.
You can import data to your notebooks from sources such as HDFS or Spark databases and files. You can then analyze the data with interpreter environments for languages such as Python, PySpark, and Spark.
Installing Jupyter for ODH
To set up Jupyter notebooks on ODH, you must install both PySpark and Jupyter.
Setting Up PySpark Integration
To integrate Jupyter and PySpark, install the findspark
application.
ODH cluster nodes include Python 3, Apache Spark3 client, and PySpark.
PySpark is an interface for Apache Spark in Python. With PySpark, you can write Spark applications using Python APIs. The PySpark shell is an environment for analyzing data in a distributed environment.
The findspark
application finds and adds PySpark to the
system path. This way, Jupyter, and PySpark integrate seamlessly.
Installing Jupyter
Install Jupyter on the same node as the one you set up for PySpark integration.
Troubleshooting
Set up the Hive configuration in Apache Ambari to avoid Hive exception errors.
If you get the following exception message, when you create a Hive table:
AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table odh.emp failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.);
Then follow these steps: