19 Content Management

This chapter provides an overview of Oracle's content management features.

This chapter contains the following topics:

Introduction to Content Management

Oracle Database includes datatypes to handle all the types of rich Internet content such as relational data, object-relational data, XML, text, audio, video, image, and spatial. These datatypes appear as native types in the database. They can all be queried using SQL. A single SQL statement can include data belonging to any or all of these datatypes.

As applications evolve to encompass increasingly richer semantics, they encounter the need to deal with the following kinds of data:

  • Simple structured data

  • Complex structured data

  • Semi-structured data

  • Unstructured data

Traditionally, the relational model has been very successful at dealing with simple structured data—the kind which can fit into simple tables. Oracle added object-relational features so that applications can deal with complex structured data—collections, references, user-defined types, and so on. Queuing technologies, such as Oracle Streams Advanced Queuing, deal with messages and other semi-structured data. This chapter discusses Oracle's technologies to support unstructured data.

Unstructured data cannot be decomposed into standard components. Data about an employee can be "structured" into a name (probably a character string), an identification (likely a number), a salary, and so on. But if you are given a photo, you find that the data really consists of a long stream of 0s and 1s. These 0s and 1s are used to switch pixels on or off, so that you see the photo on a display, but it cannot be broken down into any finer structure in terms of database storage.

Unstructured data such as text, graphic images, still video clips, full motion video, and sound waveforms tend to be large -- a typical employee record may be a few hundred bytes, but even small amounts of multimedia data can be thousands of times larger. Some multimedia data may reside on operating system files, and it is desirable to access them from the database.

Overview of XML in Oracle Database

Extensible Markup Language (XML) is a tag-based markup language that lets developers create their own tags to describe data that's exchanged between applications and systems. XML is widely adopted as the common language of information exchange between companies. It is human-readable; that is, it is plain text. Because it is plain text, XML documents and XML-based messages can be sent easily using common protocols, such as HTTP or FTP.

Oracle XML DB treats XML as a native datatype in the database. Oracle XML DB is not a separate server. The XML data model encompasses both unstructured content and structured data. Applications can use standard SQL and XML operators to generate complex XML documents from SQL queries and to store XML documents.

Oracle XML DB provides capabilities for both content-oriented and data-oriented access. For developers who see XML as documents (news stories, articles, and so on), Oracle XML DB provides an XML repository accessible from standard protocols and SQL.

For others, the structured-data aspect of XML (invoices, addresses, and so on) is more important. For these users, Oracle XML DB provides a native XMLType, support for XML Schema, XPath, XSLT, DOM, and so on. The data-oriented access is typically more query-intensive.

The Oracle XML developer's kits (XDK) contain the basic building blocks for reading, manipulating, transforming, and viewing XML documents, whether on a file system or stored in a database. They are available for Java, C, and C++. Unlike many shareware and trial XML components, the production Oracle XDKs are fully supported and come with a commercial redistribution license. Oracle XDKs consist of the following components:

  • XML Parsers: supporting Java, C, and C++, the components create and parse XML using industry standard DOM and SAX interfaces.

  • XSLT Processor: transforms or renders XML into other text-based formats, such as HTML.

  • XML Schema Processor: supporting Java, C, and C++, allows use of XML simple and complex datatypes.

  • XML Class Generator: automatically generates Java and C++ classes from XSL schemas to send XML data from Web forms or applications.

  • XML Java Beans: visually view and transform XML documents and data with Java components.

  • XML SQL Utility: supporting Java, generates XML documents, DTDs, and schemas from SQL queries.

  • XSQL Servlet: combines XML, SQL, and XSLT in the server to deliver dynamic Web content.

Overview of LOBs

The large object (LOB) datatypes BLOB, CLOB, NCLOB, and BFILE enable you to store and manipulate large blocks of unstructured data (such as text, graphic images, video clips, and sound waveforms) in binary or character format. They provide efficient, random, piece-wise access to the data.

With the growth of the internet and content-rich applications, it has become imperative that databases support a datatype that fulfills the following:

  • Can store unstructured data with compression, encryption, or deduplication.

  • Is optimized for large amounts of such data: up to 128 terabytes.

  • Provides a uniform way of accessing large unstructured data within the database or outside in operating system files which are read-only.

For the LOBs with STORE AS SECUREFILE option (introduced in release 11.1) you can specify the SQL parameter DEDUPLICATE in CREATE TABLE and ALTER TABLE statements. This enables you to specify that LOB data that are identical in two or more rows in a LOB column will all share the same data blocks, thus saving disk space. KEEP_DUPLICATES turns off this capability. The following options are also used with SECUREFILE:

The parameter COMPRESS turns on LOB compression. NOCOMPRESS turns LOB compression off.

The parameter ENCRYPT turns on LOB encryption and optionally selects an encryption algorithm. NOENCRYPT turns off LOB encryption.

The pre-release 11.1 LOBs paradigm is the default. It is also now explicitly set by the option STORE AS BASICFILE.

The following SQL and PL/SQL statements, and OCI functions are used with the SECUREFILE features:

See Also:

Overview of Oracle Text

Oracle Text indexes any document or textual content to add fast, accurate retrieval of information to internet content management applications, e-Business catalogs, news services, job postings, and so on. It can index content stored in file systems, databases, or on the Web.

Oracle Text allows text searches to be combined with regular database searches in a single SQL statement. It can find documents based on their textual content, metadata, or attributes. The Oracle Text SQL API makes it simple and intuitive to create and maintain Text indexes and run Text searches.

Oracle Text is completely integrated with Oracle Database, making it inherently fast and scalable. The Text index is in the database, and Text queries are run in the Oracle Database process. The Oracle Database optimizer can choose the best execution plan for any query, giving the best performance for ad hoc queries involving Text and structured criteria. Additional advantages include the following:

  • Oracle Text supports multilingual querying and indexing.

  • You can index and define sections for searching in XML documents. Section searching lets you narrow down queries to blocks of text within documents. Oracle Text can automatically create XML sections for you.

  • A Text index can span many Text columns, giving the best performance for Text queries across more than one column.

  • Oracle Text has enhanced performance for operations that are common in Text searching, like count hits.

  • Oracle Text leverages scalability features, such as replication.

  • Oracle Text supports local partitioned index.

This section includes the following topics:

Oracle Text Index Types

There are three Oracle Text index types to cover all text search needs.

  • Standard index type for traditional full-text retrieval over documents and Web pages. The context index type provides a rich set of text search capabilities for finding the content you need, without returning pages of spurious results.

  • Catalog index type, designed specifically for e-Business catalogs. This catalog index provides flexible searching and sorting at Web-speed.

  • Classification index type for building classification or routing applications. This index is created on a table of queries, where the queries define the classification or routing criteria.

Oracle Text also provides substring and prefix indexes. Substring indexing improves performance for left-truncated or double-truncated wildcard queries. Prefix indexing improves performance for right truncated wildcard queries.

Oracle Text Document Services

Oracle Text provides a number of utilities to view text, no matter how that text is stored.

  • Oracle Text supports over 150 document formats through its Inso filtering technology, including all common document formats like XML, PDF, and MS Office. You can also create your own custom filter.

  • You can view the HTML version of any text, including formatted documents such as PDF, MS Office, and so on.

  • You can view the HTML version of any text, with search terms highlighted and with navigation to next/previous term in the text.

  • Oracle Text provides markup information; for example, the offset and length of each search term in the text, to be used for example by a third party viewer.

Oracle Text Query Package

The CTX_QUERY PL/SQL package can be used to generate query feedback, count hits, and create stored query expressions.

See Also:

Oracle Text Reference for information about this package

Oracle Text Advanced Features

With Oracle Text, you can find, classify, and cluster documents based on their text, metadata, or attributes.

Document classification performs an action based on document content. Actions can be assigned category IDs to a document for future lookup or for sending a document to a user. The result is a set, or stream, of categorized documents. For example, assume that there is an incoming stream of news articles. You can define a rule to represent the category of Finance. The rule is essentially one or more queries that select documents about the subject of finance. The rule might have the form "stocks or bonds or earnings." When a document arrives that satisfies the rules for this category, the application takes an action, such as tagging the document as Finance or e-mailing one or more users.

Clustering is the unsupervised division of patterns into groups. The interface lets users select the appropriate clustering algorithm. Each cluster contains a subset of documents of the collection. A document within a cluster is believed to be more similar with documents inside the cluster than with outside documents. Clusters can be used to build features like presenting similar documents in the collection.

Overview of Oracle Ultra Search

Oracle Ultra Search is built on Oracle Database and Oracle Text technology that provides uniform search-and-locate capabilities over multiple repositories: Oracle Databases, other ODBC compliant databases, IMAP mail servers, HTML documents served up by a Web server, files on disk, and more.

Oracle Ultra Search uses a crawler to index documents; the documents stay in their own repositories, and the crawled information is used to build an index that stays within your firewall in a designated Oracle database. Oracle Ultra Search also provides APIs for building content management solutions.

Oracle Ultra Search offers the following:

  • A complete text query language for text search inside the database

  • Full integration with Oracle Database and the SQL query language

  • Advanced features like concept searching and theme analysis

  • Indexing of all common file formats (150+)

  • Full globalization, including support for Chinese, Japanese and Korean (CJK), and Unicode

Overview of Oracle Multimedia

Oracle Multimedia (formerly known as Oracle interMedia) is a feature that enables Oracle Database to store, manage, and retrieve images, Digital Imaging and Communications in Medicine (DICOM), audio, and video data in an integrated fashion with other enterprise information. Oracle Multimedia extends Oracle Database reliability, availability, and data management to media content and medical image content in traditional, medical, Internet, electronic commerce, and media-rich applications.

Oracle Multimedia manages media content by providing the following:

  • Storage and retrieval of media data in the database to synchronize the media data with the associated business data

  • Support for popular image, audio, and video formats

  • Extraction of format and application metadata into XML documents

  • Full object and relational interfaces to Oracle Multimedia services

  • Access through traditional and Web interfaces

  • Querying using associated relational data and extracted metadata

  • Image processing, such as thumbnail generation

  • Delivery through RealNetworks and Windows Media Streaming Servers

Oracle Multimedia manages DICOM content by providing the following:

  • Storage and retrieval of medical imaging data in the database to synchronize the DICOM data with the associated business data

  • Full object and relational interfaces to Oracle Multimedia DICOM services

  • Extraction of DICOM metadata into user-specifiable XML documents

  • Querying using associated relational data and extracted metadata

  • Image processing, such as thumbnail generation

  • Creation of new DICOM objects

  • Conformance validation based on a set of user-specified conformance rules

  • Making DICOM objects anonymous based on user-defined rules that specify the set of attributes to be made anonymous and how to make those attributes anonymous

  • The ability to update run-time behaviors, such as the version of the DICOM standard supported, without installing a new release of Oracle Database

Overview of Oracle Spatial

Oracle Spatial is designed to make spatial data management easier and more natural to users of location-enabled applications and geographic information system (GIS) applications. When spatial data is stored in Oracle Database, it can be easily manipulated, retrieved, and related to all other data stored in the database.

A common example of spatial data can be seen in a road map. A road map is a two-dimensional object that contains points, lines, and polygons that can represent cities, roads, and political boundaries such as states or provinces. A road map is a visualization of geographic information. The location of cities, roads, and political boundaries that exist on the surface of the Earth are projected onto a two-dimensional display or piece of paper, preserving the relative positions and relative distances of the rendered objects.

The data that indicates the Earth location (such as longitude and latitude) of these rendered objects is the spatial data. When the map is rendered, this spatial data is used to project the locations of the objects on a two-dimensional piece of paper. A GIS is often used to store, retrieve, and render this Earth-relative spatial data.

Types of spatial data (other than GIS data) that can be stored using Spatial include data from computer-aided design (CAD) and computer-aided manufacturing (CAM) systems. Instead of operating on objects on a geographic scale, CAD/CAM systems work on a smaller scale, such as for an automobile engine or printed circuit boards.

The differences among these systems are in the size and precision of the data, not the data's complexity. The systems might all involve the same number of data points. On a geographic scale, the location of a bridge can vary by a few tenths of an inch without causing any noticeable problems to the road builders, whereas if the diameter of an engine's pistons is off by a few tenths of an inch, the engine will not run.

In addition, the complexity of data is independent of the absolute scale of the area being represented. For example, a printed circuit board is likely to have many thousands of objects etched on its surface, containing in its small area information that may be more complex than the details shown on a road builder's blueprints.

These applications all store, retrieve, update, or query some collection of features that have both nonspatial and spatial attributes. Examples of nonspatial attributes are name, soil_type, landuse_classification, and part_number. The spatial attribute is a coordinate geometry, or vector-based representation of the shape of the feature.

Oracle Spatial provides a SQL schema and functions that facilitate the storage, retrieval, update, and query of collections of spatial features in Oracle Database. Spatial consists of the following:

  • A schema (MDSYS) that prescribes the storage, syntax, and semantics of supported geometric datatypes

  • A spatial indexing mechanism

  • Operators, functions, and procedures for performing area-of-interest queries, spatial join queries, and other spatial analysis operations

  • Functions and procedures for utility and tuning operations

  • Topology data model for working with data about nodes, edges, and faces in a topology.

  • Network data model for representing capabilities or objects that are modeled as nodes and links in a network.

  • GeoRaster, a feature that lets you store, index, query, analyze, and deliver GeoRaster data, that is, raster image and gridded data and its associated metadata.