Oracle Stream Explorer supports Big Data with the Hadoop, NoSQLDB, and HBase cartridges. Hadoop cartridge is extension for an Oracle CQL processor to access large quantities of data in a Hadoop distributed file system (HDFS). HDFS is a non-relational data store. NoSQL cartridge is extension for an Oracle CQL processor to access large quantities of data in an Oracle NoSQL database. The Oracle NoSQL database stores data in key-value pairs. HBase cartridge is extension for an Oracle CQL processor to access large quantities of data in a HBase database. HBase is a distributed column-oriented database built on top of the Hadoop file system.
This chapter includes the following sections:
Big Data is huge and complex data sets for which the traditional data processing applications are insufficient. Big Data describes a holistic information management strategy that includes and integrates many new types of data and data management alongside traditional data.
Big data has also been defined by the four Vs:
Volume — the amount of data. While volume indicates more data, it is the granular nature of the data that is unique. Big Data requires processing high volumes of low-density, unstructured Hadoop data—that is, data of unknown value, such as Twitter data feeds, click streams on a web page and a mobile application, network traffic, sensor-enabled equipment capturing data at the speed of light, and many more. It is the task of Big Data to convert such Hadoop data into valuable information.
Velocity — the fast rate at which data is received and acted upon. The highest velocity data normally streams directly into memory versus being written to disk.
Variety — new unstructured data types. Unstructured and semi-structured data types, such as text, audio, and video require additional processing to both derive meaning and the support metadata. Once understood, unstructured data has many of the same requirements as structured data, such as summarization, lineage, auditability, and privacy.
Value — data has intrinsic value, but it must be discovered. There are a range of quantitative and investigative techniques to derive value from data.
Hadoop is an open source technology that provides access to large data sets that are distributed across clusters. One strength of the Hadoop software is that it provides access to large quantities of data not stored in a relational database. The Oracle Stream Explorer data cartridge is based on the Cloudera distribution for Hadoop (CDH), version 3u5
The content in this guide assumes that you are already familiar with, and likely running, a Hadoop system. If you need more information about Hadoop, start with the Hadoop project web site at http://hadoop.apache.org/
.
Note:
You can use the Hadoop data cartridge on UNIX and Windows even through Hadoop itself runs only in the Linux environment.
You can use the Hadoop data cartridge to integrate an existing Hadoop data source into an event processing network that can process data from files on the Hadoop distributed file system. With the data source integrated, you can write Oracle CQL query code that incorporates data from files on the Hadoop system.
When integrating a Hadoop system, keep the following guidelines in mind:
The Hadoop cluster must have been started through its own mechanism and must be accessible. The cluster is not managed directly by Oracle Stream Explorer.
A file from a Hadoop system supports only joins using a single key in Oracle CQL. However, any property of the associated event type may be used as key. In other words, with the exception of a key whose type is byte array, you can use keys whose type is other than a String type.
Joins must use the equals operator. Other operators are not supported in a join condition.
For the event type you define to represent data from the Hadoop file, only tuple-based event types are supported.
The order of properties in the event type specification must match the order of fields in the Hadoop file.
To avoid throwing a NullPointerExeption, wait for the Hadoop Data Cartridge to finish processing before attempting to shut down the server or undeploy.
Only the following Oracle CQL to Hadoop types are supported. Any other type will cause a configuration exception to be raised.
Table 5-1 Mapping Between Datatypes for Oracle CQL and Hadoop
Oracle CQL Datatype | Hadoop Datatype |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
To understand how a Hadoop data source might be used with an Oracle Stream Explorer application, consider a scenario with an application that requires quick access to a very large amount of customer purchase data in real time.
In this case, the data stored in Hadoop includes all purchases by all customers from all stores. Values in the data include customer identifiers, store identifiers, product identifiers, and so on. The purchase data includes information about which products are selling best in each store. To render the data to a manageable state, a MapReduce function is used to examine the data and produce a list of top buyers (those to whom incentives will be sent).
This data is collected and managed by a mobile application vendor as part of a service designed to send product recommendations and incentives (including coupons) to customers. The data is collected from multiple retailers and maintained separately for each retailer.
The Oracle Stream Explorer application provides the middle-tier logic for a client-side mobile application that is designed to offer purchase incentives to top buyers. It works in the following way:
You use the Hadoop support included with Oracle Stream Explorer by integrating a file in an existing Hadoop system into an event processing network. With the file integrated, you have access to data in the file from Oracle CQL code.
This section describes the following:
In order to use Hadoop from Oracle Stream Explorer, you must first make configuration changes on both the Oracle Stream Explorer and Hadoop servers:
On the Oracle Stream Explorer server, add the following Hadoop configuration files at the server's bootclasspath: core-site.xml, hdfs.xml, and mapred.xml. See Administering Oracle Stream Explorer for information about the bootclasspath.
To the Hadoop server, copy the Pig JAR file to the lib directory and include it as part of the HADOOP_CLASSPATH defined in the hadoop-env.sh file.
Note:
A connection with a Hadoop data source through the cartridge might require many input/output operations, such that undeploying the application can time out or generate errors that prevent the application from being deployed again. Before undeploying an application that uses a Hadoop cartridge, be sure to discontinue event flow into the application.
Integrating a file from an existing Hadoop system is similar to the way you might integrate a table from an existing relational database. For a Hadoop file, you use the file
XML element from the Oracle Stream Explorer schema specifically added for Hadoop support.
The file element is from the http://www.oracle.com/ns/ocep/hadoop namespace. So your EPN assembly file needs to reference that namespace. The file
element includes the following attributes:
id
-- Uniquely identifies the file in the EPN. You will use this attribute's value to reference the data source in a processor.
event-type
-- A reference to the event-type to which data from the file should be bound. The event-type must be defined in the EPN.
path
-- The path to the file in the Hadoop file system.
separator
-- Optional. The character delimiter to use when parsing the lines in the Hadoop file into separate fields. The default delimiter is the comma (',') character.
operation-timeout
-- Optional. The maximum amount of time, in milliseconds, to wait for the operation to complete.
With the Hadoop file to integrate specified with the file element, you use the table-source element to add the file as a data source for the Oracle CQL processor in which you will be using the file's data.
In the following example, note that the http://www.oracle.com/ns/ocep/hadoop
namespace (and hadoop
prefix) is referenced in the beans element. The file
element references a CustomerDescription.txt
file for data, along with a CustomerDescription
event type defined in the event type repository.
<?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:wlevs="http://www.bea.com/ns/wlevs/spring" xmlns:hadoop="http://www.oracle.com/ns/ocep/hadoop/" xsi:schemaLocation=" http://www.bea.com/ns/wlevs/spring http://www.bea.com/ns/wlevs/spring/ocep-epn.xsd http://www.oracle.com/ns/ocep/hadoop http://www.oracle.com/ns/ocep/hadoop/ocep-hadoop.xsd"> <!-- Some schema references omitted for brevity. --> <!-- Event types that will be used in the query. --> <wlevs:event-type-repository> <wlevs:event-type type-name="SalesEvent"> <wlevs:class>com.bea.wlevs.example.SalesEvent</wlevs:class> </wlevs:event-type> <wlevs:event-type type-name="CustomerDescription"> <wlevs:properties> <wlevs:property name="userId" type="char"/> <wlevs:property name="creditScore" type="int"/> <wlevs:property name="address" type="char"/> <wlevs:property name="customerName" type="char"/> </wlevs:properties> </wlevs:event-type> </wlevs:event-type-repository> <!-- Input adapter omitted for brevity. --> <!-- Channel sending SalesEvent instances to the processor. --> <wlevs:channel id="S1" event-type="SalesEvent" > <wlevs:listener ref="P1"/> </wlevs:channel> <!-- The file element to integrate CustomerDescription.txt file from the Hadoop system into the EPN. --> <hadoop:file id="CustomerDescription" event-type="CustomerDescription" path="CustomerDescription.txt" /> <!-- The file from the Hadoop system tied into the query processor with the table-source element. --> <wlevs:processor id="P1"> <wlevs:table-source ref="CustomerDescription" /> </wlevs:processor> <!-- Other stages omitted for brevity. --> </beans>
After you have integrated a Hadoop file into an event processing network, you can query the file from Oracle CQL code.
The following example illustrates how you can add a file from a Hadoop system into an EPN. With the file added to the EPN, you can query it from Oracle CQL code, as shown in the following example.
In the following example, the processor receives SalesEvent instances from a channel, but also has access to a file in the Hadoop system as CustomerDescription instances. The Hadoop file is essentially a CSV file that lists customers. Both event types have a userId
property.
<n1:config xsi:schemaLocation="http://www.bea.com/ns/wlevs/config/application wlevs_application_config.xsd" xmlns:n1="http://www.bea.com/ns/wlevs/config/application" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <processor> <name>P1</name> <rules> <query id="q1"><![CDATA[ SELECT customerName, creditScore, price, item FROM S1 [Now], CustomerDescription as cust WHERE S1.userId = cust.userId AND S1.price > 1000 ></query> </rules> </processor> </n1:config>
The Oracle NoSQL Database is a distributed key-value database. In it, data is stored as key-value pairs, which are written to particular storage node(s). Storage nodes are replicated to ensure high availability, rapid failover in the event of a node failure and optimal load balancing of queries.
The content in this guide assumes that you are already familiar with, and likely running, an Oracle NoSQL database. If you need more information about Oracle NoSQL, be sure to see its Oracle Technology Network page at http://www.oracle.com/technetwork/database/database-technologies/nosqldb/documentation/index.html
.
Note:
To use the NoSQL Data Cartridge, you must have a license for NoSQL Enterprise Edition.
You can use the Oracle Stream Explorer NoSQL Database data cartridge to refer to data stored in Oracle NoSQL Database as part of an Oracle CQL query. The cartridge makes it possible for queries to retrieve values from an Oracle NoSQL Database store by specifying a key in the query and then referring to fields of the value associated with the key.
When integrating an Oracle NoSQL database, keep the following guidelines in mind:
The NoSQL database must have been started through its own mechanisms and must be accessible. It is not managed directly by Oracle Stream Explorer.
This release of the cartridge provides access to the database using release 2.1.54 of the Oracle NoSQL Database API.
The property used as a key in queries must be of type String
. Joins can use a single key only.
Joins must use the equals operator. Other operators are not supported in a join condition.
Runaway queries that involve the NoSQL database are not supported. A runaway query has an execution time that takes longer than the execution time estimated by the optimizer.
The Oracle Stream Explorer NoSQL cartridge uses the cartridge ID com.oracle.cep.cartridge.nosqldb
.
To use the Oracle Stream Explorer NoSQL Database data cartridge in a CQL application, you must declare and configure it in one or more application-scoped cartridge contexts for the application.
Integrating an existing NoSQL database is similar to the way you might integrate a table from a relational database. For a NoSQL database, you update the EPN assembly file in the following ways (see the example in step 3):
The following example illustrates how you can connect an event processing network to a NoSQL database. The store
element provides access to a store named "kvstore-customers", using port 5000 on host kvhost-alpha or port 5010 on host kvhost-beta to make the initial connection. It defines Oracle CQL processor P1 and makes the data in the key-value store available to it as a relation named "CustomerDescription".
The store can be referred to within Oracle CQL queries using the name "CustomerDescription". All values retrieved from the store should be serialized instances of the CustomerDescription class.
<?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:osgi="http://www.springframework.org/schema/osgi" xmlns:wlevs="http://www.bea.com/ns/wlevs/spring" xmlns:nosqldb="http://www.oracle.com/ns/oep/nosqldb/" xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/osgi http://www.springframework.org/schema/osgi/spring-osgi.xsd http://www.bea.com/ns/wlevs/spring http://www.bea.com/ns/wlevs/spring/ocep-epn.xsd http://www.oracle.com/ns/oep/nosqldb http://www.oracle.com/ns/oep/nosqldb/oep-nosqldb.xsd"> <!-- Provide access to the CustomerDescription class, which represents the type of values in the store. --> <wlevs:event-type-repository> <wlevs:event-type type-name="CustomerDescription"> <wlevs:class>com.bea.wlevs.example.CustomerDescription</wlevs:class> </wlevs:event-type> <wlevs:event-type type-name="SalesEvent"> <wlevs:class>com.bea.wlevs.example.SalesEvent</wlevs:class> </wlevs:event-type> </wlevs:event-type-repository> <!-- The store element declares the key-value store, along with the event type to which incoming NoSQL data will be bound. --> <nosqldb:store store-name="kvstore-customers" store-locations="kvhost-alpha:5000 kvhost-beta:5010" id="CustomerDescription" event-type="CustomerDescription"/> <wlevs:channel id="S1" event-type="SalesEvent"> <wlevs:listener ref="P1"/> </wlevs:channel> <!- The table-source element links the store to the CQL processor. --> <wlevs:processor id="P1"> <wlevs:table-source ref="CustomerDescription" /> </wlevs:processor> </beans>
If Oracle CQL queries refer to entries in a store specified by a store
element, then the values of those entries must be serialized instances of the type specified by the event-type attribute. The event type class must implement java.io.Serializable
.
If a query retrieves a value from the store that is not a valid serialized form, or if the value is not the serialized form for the specified class, then Oracle Stream Explorer throws an exception and event processing is halted. You can declare multiple store
elements to return values of different types from the same or different stores.
After you have integrated a NoSQL database into an event processing network, you can access data from Oracle CQL code. The query can look up an entry from the store by specifying an equality relation in the query's WHERE
clause.
<n1:config xsi:schemaLocation="http://www.bea.com/ns/wlevs/config/application wlevs_application_config.xsd" xmlns:n1="http://www.bea.com/ns/wlevs/config/application" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <processor> <name>P1</name> <rules> <query id="q1"><![CDATA[ SELECT customerName, creditScore, price, item FROM S1 [Now], CustomerDescription as cust WHERE S1.userId = cust.userId AND creditScore > 5 ></query> </rules> </processor> </n1:config>
In this example, the event type instances representing data from the S1 channel and CustomerDescription NoSQL data source are both implemented as JavaBeans classes. Because both event types are JavaBeans classes, the Oracle CQL query can access the customer description associated with a particular event by equating the event's user ID with that of the customer description in the WHERE
clause, treating both as JavaBeans properties:
WHERE S1.userId = CustomerDescription.userId
This clause requests that an entry be retrieved from the store that has the key specified by the value of the event's userId
field. Only equality relations are supported for obtaining entries from the store.
Once an entry from the store has been selected, fields from the value retrieved from the store can be referred to in the SELECT
portion of the query or in additional clauses in the WHERE
clause.
The creditScore
value specified in the SELECT
clause will include the value of the creditScore
field of the CustomerDescription
object retrieved from the store in the query output. The reference to creditScore
in the WHERE
clause will also further restrict the query to events where the value of the CustomerDescription
creditScore
field is greater than 5.
The key used to obtain entries from the store can be formatted in one of two ways: by beginning the value with a forward slash ('/') or by omitting a slash.
If the value specified on the left hand side of the equality relation starts with a forward slash, then the key is treated as a full key path that specifies one or more major components, as well as minor components if desired. For more details on the syntax of key paths, see the information about the oracle.kv.Key
class in the Oracle NoSQL Database API documentation at http://docs.oracle.com/cd/NOSQL/html/javadoc/index.html
.
For example, if the userId
field of a SalesEvent
object has the value "/users/user42/-/custDesc", then that value will be treated as a full key path that specifies "users" as the first major component, the user ID "user42" as the second major component, and a minor component named "custDesc".
As a convenience, if the value specified on the left hand side of the equality relation does not start with a forward slash, then it is treated as a single major component that comprises the entire key.
Note that keys used to retrieve entries from the store must be specified in full by a single field accessed by the Oracle CQL query. In particular, if a key path with multiple components is required to access entries in the key-value store, then the full key path expression must be stored in a single field that is accessed by the query.
HBase Big Data Cartridge is an integration of HBase with Oracle Stream Explorer. HBase is a type of NOSQL database that is distributed, versioned, and a non-relational database.
HBase Big Data Cartridge does not support SQL as a primary means to access data. HBase provides Java APIs to retrieve the data. Every row has a key. All columns belong to particular column family. Each column family consists of one or more qualifiers. Hence, a combination of row key, column family and column qualifier is required to retrieve the data. HBase is suitable for storing Big Data without an RDBMS.
Every table has a row key like a relational database. The HBase column qualifier is similar to the concept of minor keys in NoSqlDB. For example, the major key for records could be the name of a person, whereas the minor key would be the different pieces of information that you want to store for the person.
In HBase, the priority is to store Big Data efficiently and not perform any complex data retrieval operations.
The code snippets below give an idea of how data can be stored and retrieved in HBase:
HBaseConfiguration config = new HBaseConfiguration(); batchUpdate.put("myColumnFamily:columnQualifier1", "columnQualifier1value".getBytes()); Cell cell = table.get("myRow", "myColumnFamily:columnQualifier1"); String valueStr = new String(cell.getValue());
HBase is used to store metadata for various applications. For example, a company may store customer information associated with various sales in an HBase database. In this case, you can use the HBase Big Data Cartridge that enables you to write CQL queries using HBase as an external data source.
The HBase database store
EPN component is provided as a data cartridge. The HBase database cartridge provides a <store>
EPN component with the following properties:
id
: of the EPN component.
store-location
: location in the form of domain:clientport
of an HBase database server.
event-type
: schema for store as seen by the CQL processor. The event type must be a Java class that implements the java.io.Serializable interface
.
table-name
: Name of the HBase table.
name
: the id of the <store>
EPN component for which the mappings are being declared.
rowkey
: the row key attribute used for the HBase table. This must be a String
.
cql-attribute
: the CQL attribute name used in the CQL query. This name should match with a corresponding field declared in the Java event type.
hbase-family
: the HBase column family.
hbase-qualifer
: the HBase column qualifier.
To use a HBase Cartidge, you need to specify the hbase-family
and hbase-qualifier
if the cql-attribute
is a primitive data type.
The <hbase:store>
component is linked to a CQL processor using the 'table-source' element, as in the following example:
<hbase:store id="User" table-name="User" event-type="UserEvent" store-location="localhost:5000"> </hbase:store> <wlevs:processor id="P1"> <wlevs:table-source ref="User"/> </wlevs:processor>
You must specify the column mappings for the <hbase:store>
component in the Oracle Stream Explorer HBase configuration file as shown in the following example:
Example 5-1 HBase Cartridge Column Mappings
In the example below, the CQL column address
is a map as it holds all the column qualifiers from the address
column family. The CQL columns firstname
, lastname
, email
and role
hold primitive data types. These are the specific column qualifiers from the data
column family. The userName
field from the event type is the row key and hence it does not have any mapping to an HBase column family or qualifier.
<hbase:column-mappings> <name>User</name> <rowkey>userName</rowkey> <mapping cql-attribute="address" hbase-family="address"/> <mapping cql-attribute="firstname" hbase-family="data" hbase-qualifier="firstname"/> <mapping cql-attribute="lastname" hbase-family="data" hbase-qualifier="lastname"/> <mapping cql-attribute="email" hbase-family="data" hbase-qualifier="email"/> <mapping cql-attribute="role" hbase-family="data" hbase-qualifier="role"/> </hbase:column-mappings>
The <UserEvent>
class has the following fields:
String userName; java.util.Map address; String first name; String lastname; String email; String role;
The HBase schema is dynamic in nature and additional column families and/or column qualifiers can be added at any point after an HBase table is created. Oracle Stream Explorer allows you to retrieve the event fields as a map which contains all dynamically added column qualifiers. In this case, you need to declare a java.util.Map
as one of the event fields in the Java event type. Hence the UserEvent
event type must have a java.util.Map
field with name and address. The cartridge does not support dynamically added column families. So, the event type needs to be modified if the Oracle Stream Explorer application needs to use a newly added column family.
An HBase database may be run as a cluster. The hostname
and client port
of the master node need to be configured.
Supported Operators
=
, !=
, like
, <
, and >
operators are supported. The first sub-clause in the query must be an equality join with the HBase data source based on row key.S1.userName = user.userNameThe
like
operator accepts a Java regular expression as argument. This operator is for String
only.user.firstname like Y.*
The <
and >
operators are for integer
and double
data types only.
The HBase Cartridge has a few limitations in the 12.2.1 release.
The limitations of the HBase Cartridge are as listed below:
Only the HBase server version 0.94.8 supported.
When an HBase server is unreachable, you cannot deploy a HBase cartridge application to the Oracle Stream Explorer server.
You cannot reconnect to an HBase server unless you restart the Oracle Stream Explorer server, when a HBase server is shut down after a HBase application is deployed to the Oracle Stream Explorer.
The event property data type must be String
when you want to join it with the HBase table rowkey.
In HBase cartridge application, the name of both store id
and table-name
must be the same for HBase cartridge in the spring file. Else, the table identification will fail.
Use the wrapper data type integer or double for the HBase event property data type to avoid runtime exceptions.
A clear error message is not shown when there is a syntax error in the CQL query with HBase cartridge application.
A runtime exception is thrown when you try to join a null string value with HBase rowkey.
The first sub-clause in the CQL query must be an equality join with the HBase data source based on row key.
You must suffix the letter d to the double
data type value when you use =
or !=
operator for double
data type in the CQL query.