1.4 Key Concepts
This section provides an overview of key concepts in the Compliance Studio:
- Interpreter: An interpreter is a program that directly executes instructions written in a programming or scripting language without requiring them previously to be compiled into a machine language program. They are plug-ins that enable users to use a specific language to process data in the backend. Examples of Interpreters are jdbc-interpreter, spark-interpreters, python-interpreters, etc. Interpreters allow you to define customized drivers, URLs, passwords, connections, SQL results to display, etc.
- Zeppelin Interpreter: A plug-in enables Zeppelin users to use a specific language or dataprocessing- backend. For example, to use Python code in Zeppelin, you need a %python interpreter.
- Zeppelin: Interactive browser-based notebooks enable data engineers, data analysts, and data scientists to be more productive by developing, organizing, executing, and sharing data code and visualizing results without referring to the command line or requiring the cluster details. Notebooks allow these users not only allow to execute but to interactively work with long workflows.
- Markdown (md): A plain text formatting syntax designed so that it can be converted to HTML. Use this section to configure the markdown parse type.
- Parallel Graph Analytics (PGX): Graph analysis lets you reveal latent information that is not directly apparent from fields in your data but is encoded as direct and indirect relationships - metadata - between elements of your data. This connectivity-related information is not obvious to the naked eye but can have tremendous value when uncovered. PGX is a toolkit for graph analysis, supporting both efficient graph algorithms and fast SQL-like graph pattern matching queries.
- PySpark: PySpark is the Python API for Apache Spark. It enables you to perform real-time, largescale data processing in a distributed environment using Python. Spark is a distributed framework that can handle Big Data analysis. Spark is a computational engine that works with huge sets of data by processing them in parallel and batch systems.
- Spark: A fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python, and R. Spark is an optimized engine that supports general execution graphs.
- PGQL: A graph query language built on top of SQL, bringing graph pattern matching capabilities to existing SQL users and new users interested in graph technology but who do not have an SQL background.
- Data discovery, exploration, reporting, and visualization are key components of the data science workflow. Zeppelin provides a "Modern Data Science Studio" that supports ready-to-use Spark and Hive. Zeppelin supports multiple language backends, which has support for a growing ecosystem of data sources. Zeppelin's notebooks provide interactive snippet-at-time experience to data scientists. You can see a collection of Zeppelin notebooks in the Hortonworks Gallery.
- Keytab File: A Keytab is a file containing pairs of Kerberos principles and encrypted keys (which are derived from the Kerberos password). You can use a keytab file to authenticate to various remote systems using Kerberos without entering a password. However, when changing your Kerberos password, you must recreate all your keytabs files. They are commonly used to allow scripts to automatically authenticate using Kerberos, without requiring human interaction or access to the password stored in a plain-text file. The script can use the acquired credentials to access files stored on a remote system.
- Oracle Wallet: Oracle Wallet is a file that sources database authentication and signing credentials. It allows users to securely access databases without providing credentials to thirdparty software, and easily connect to Oracle products.
- OpenSearch: OpenSearch is a distributed search and analytics engine for all data types, including textual, numerical, geospatial, structured, and unstructured.
- Conda: Miniconda3 is a minimal installer for Conda, a package
management and environment management system. It is a smaller, lighter alternative
to Anaconda, which is a more comprehensive distribution. Here are some key points
about Miniconda3:
- Minimal Installer: Miniconda3 includes only Conda, Python, and a small number of necessary packages. It allows users to create custom Python environments with only the packages they need.
- Package Management: Conda, the package manager included with Miniconda3, can install, update, and manage software packages and their dependencies. It can handle multiple versions of Python and other packages.
- Environment Management: Miniconda3 allows users to create isolated environments for different projects, ensuring that dependencies for one project do not interfere with those for another. This is particularly useful for managing different versions of Python or other software libraries.
- Customizable: Because it starts with a minimal set of
packages, users can customize their environment by installing only the
packages they need using Conda. This can lead to a more efficient and
lightweight setup compared to a full Anaconda installation.
Miniconda3 is particularly useful for users who want more control over their environment and prefer to install only the necessary packages for their specific projects.
Note:
Conda helps us to upgrade python stack across Compliance Studio version at the same time maintaining older Conda environments for backward compatibility.
For example, model deployed with an older version of Conda/Compliance Studio can co-exist with a model developed in the higher/upgraded version of the Conda/Compliance Studio.
- Workspace: Compliance Studio provides the ability to
create and manage sandbox workspaces for the creation and testing of models
in a discrete schema with a subset of production data before deployment to
the production workspace, where the model will be run on FCCM application
data directly. The application comes with two predefined schemas to be used
in sandboxes for model development with different subsets of data from
production. The workspace administration functionality and orchestration
capability will manage the movement of data from production to the
sandbox.
The workspace provides granular user access control for various activities performed within the workspace which includes data access, Notebook access, Scheduler access, etc., These workspaces allow models to be tested in the sandbox before deployment into the production environment.