4Configuring Content Collections

About Content Collections

Content collections are logical groupings of the documents that make up your knowledge base. You can create and manage external collections to make information from your organization’s web sites and other repositories available within your knowledge base.

You create external collections by logically grouping documents that have similar characteristics and content processing requirements into a single collection. You can create as many collections as you need to accommodate your organization’s content and knowledge base requirements.

Creating External Collections

You create external collections by specifying the following information.

  • The web servers that host the collection’s documents

  • The documents that make up the collection

  • How the content acquisition process interacts with the web servers and documents

  • The size and status of the collection

You create and edit external collections by using one of the following methods.

  • The Collections Wizard, which guides you through a step-by-step process

  • The Collections Form, which provides a streamlined process for experienced users

Using the Collections Form

Experienced administrators can use the collections form to define, validate, and enable document collections. You can also use the form to edit a collection by changing required or optional information in any field within the form.

You define external collections by specifying all of the required information about the web servers that host the collection and the documents that make up the collection, as well as any relevant optional information.

The form is divided into sections that prompt you to perform the following collection tasks.

  • Define the collection. You define basic collection information such as the name, locale, display locales, and so on.

  • Access the collection. You define who can access this collection.

  • Describe the collection. You specify how Search will crawl the web servers that host the documents for this collection.

  • Validate the collection. You provide optional validation rules, as needed, to the collection.

  • Review the collection options and activate the collection.

Using the Collections Wizard

The collections wizard guides you through a five-step process to create or edit an external collection.

You create an external collection by specifying all of the required collection information, as well as any relevant optional information.

You edit a collection by changing required or optional information in any of the wizard’s screens.

The wizard presents a sequence of screens that prompt you to complete the following collection tasks.

  • Define the collection.

  • Access the collection.

  • Describe the collection.

  • Validate the collection.

  • Review the collection options and activate the collection.

Managing External Collections

You can manage external collections by performing the following tasks.

  • View their status to determine whether the content processor is processing and indexing the collection properly

  • Edit collection definitions to correct errors or change configuration

  • Enable or disabling collections for processing, indexing, and availability to end users

  • Delete them from the knowledge base

Defining External Collections

You can define a new external collection, or update an existing collection definition, using either the collection wizard or the collection form. You define an external collection by specifying a name and basic information, such as the locale in which the collection documents are written, and whether you want to automatically add product names and terminology to the spelling checker.

The following table describes the fields used to define an external collection using the Collection wizard or the Collection form.

Table Defining External Collections Fields

Field Description

Collection Name

Enter a name for the collection. You can use spaces and any of following characters in collection names:

  • upper and lowercase alphabetic characters

  • numerals (0-9)

  • ( )

  • -

  • _

  • { }

  • "

  • .

Display Label

Enter a name for the collection that displays to search users. You can edit this later to change the name if necessary. The display label can be in any language with any character and is editable.

Locale

Select Auto to use automatic language selection based on document properties, or select a specific locale (language and regional variant) for the collection. This does not change the locale for any documents where the locale has already been set to something other than Auto in Authoring and does not change URL Answers.

Add product names and internal terms in this collection to the spelling checker?

Select Yes to automatically add product names and internal terminology to the spell checker. We recommend using this option for content that is edited for spelling before publishing, such as product documentation, but not for user-generated content, such as online forums.

Collection Notes

Use this optional field to enter any general descriptive information about this collection.

Use Robots?

Robot text files and robots <META> tags specify which documents and links on a site are available for web crawlers to index and follow. Click Yes to use robots.txt and the robots <META> tag. The default is No.

About Languages and Locales for External Collections

When you create an external collection, you must select a locale or a language. The locale or language that you select determines which users can view the content in the collection.

A locale designates a language as it is used in a specific region or by a specific population. For example, there are differences in how English is used in the United States compared with its use in Canada. Locales have codes that identify their language and region. For example, the code for English Canada is en_CA, and the code for English United States is en_US.

You select a locale, a language, or automatic language detection for a collection based on the language of the documents it contains, as follows:

  • If the collection includes documents authored in a single language, but not a specific regional variant, then select the language.

  • If the collection includes documents authored in multiple languages, then select Auto, which uses document properties to assign a language to each document in the collection.

  • If the collection includes documents authored in a specific regional variant of a language, then select the locale.

For information on assigning languages and locales, see How to Assign Languages and Locales to External Collections.

    How to Assign Languages and Locales to External Collections

    In applications that support multiple locales, each supported locale has a corresponding interface. The locale or language that you assign to a collection determines the interfaces in which it will be available, as follows:

    • Collections that have a specified locale are available only in the interface that corresponds to that locale

    • Collections that have a specified language are available in all interfaces that correspond to the base language.

    • Documents in collections that use automatic language detection are available in all interfaces that correspond to their assigned language.

    For example, an application may support the following locales:

    • English United States

    • English Canada

    • French Canada

    • French France

    You can assign collections to locales to make them available to only one interface. For example:

    • The EnglishCanadadocs collection is assigned to the English Canada locale, and is available only in the English Canada interface.

    • The FrenchFrancedocs collection is assigned to the French France locale, and is available only in the French France interface.

    You can assign collections to languages to make them available to all interfaces based on that language. For example:

    • The Englishdocs collection is assigned to the English language, and is available in the English United States and English Canada interfaces.

    • The Frenchdocs collection is assigned to the French language, and is available in the French Canada and French France interfaces.

    You can assign collections to Auto to make documents in the collection available to any interfaces based on the detected language. For example, the Mixedlanguagedocs collection contains both English and French documents. English documents in the collection will be available in the English United States, English Canada interfaces, and French documents will be available in the French Canada, French France interfaces.

      About Document Titles in Search Results

      A document title appears in Search results as the full title when the title is short, or a section or excerpt of a longer title. When a document has an obvious title (simply the string of words on the first page of a document that you would recognize as its title), Search selects that title automatically for the list of search results. If a document does not have an obvious full title, Search uses methodologies to determine the title based on:

      • The position of the text on a front page, for example, the first few sentences of the document.

      • The size of the font. Usually the larger text on a first page denotes a title.

      • The entry in the Title field in the Document Properties.

      In some cases the document titles may not reflect the content properly or completely. A good document title identifies the topic and begins to inform the reader about the content. A poor title can obscure vital information so your customers may not select it although it contains the desired content. For example, the title Service Cloud Administrator Guide describes the product and audience, but the title Admin Guide does not provide enough information.

      The best way to ensure a good document title is to encourage authors to add good titles to their documents. However, when communicating with authors is not possible, you can edit current document titles with more meaningful titles. For more information on editing document titles, see Editing Generated Document Titles.

        Editing Generated Document Titles

        There may be instances when you need to edit or override the generated title of a document in your Web collection to create a more relevant title. These instances might include the following:

        • The title is missing a subtitle that can provide important details for user, such as a category, model number, or release date. For example, Product ABC User Guide, Release 1.2, is a more useful title than Product User Guide.

        • The title is too ubiquitous or commonplace to be useful, for example User Manual or White Paper.

        • The author of the document provided a different title from the generated title.

        • There are multiple documents or attachments with the same name.

        Only authors and other roles that have the proper entitlements to create and edit documents and collections can override document titles.

        Use the following procedure to edit the generated title of a document or attachment.

        1. Navigate to the Collection Setup page and then click on the collection in which the document resides.

          Note: You cannot edit the name of the collection itself. You can only edit or override titles of the documents and attachments within the collection.
        2. Select the document or attachment for which you want to change an existing title.

        3. Type the new title over the existing title.

        4. Click Save.

          After the application runs nightly (or weekly) content processing on the collection, the new title appears for the document.

          Configuring Access to External Collections

          You configure content visibility for user groups and content processing authentication for external collections using either the Collection Wizard or Manage Collections, Collection Form.

          You can specify whether the content in this collection will be visible to all users, or to specified groups of users. You must also specify whether the content acquisition process needs authorization to access one or more of the web sites that hosts the documents in the collection.

            Restricting Collection Visibility to Specified User Groups

            If you want all users to view content in this collection, select Yes option in the Can everyone see this collection? field. The default is Yes.

            If you want only members of selected user groups to view content in this collection, select No, and then select the desired user groups from the Available User Groups column. The Selected User Groups column displays the user groups that you can select to view the contents of this collection. If you do not select a user group, the content is public and unrestricted.

              Specifying Authentication for Content Acquisition

              If the content acquisition process does not require any authorization, select the No option in the Do you need authentication to access these documents? field. No is the default.

              If content acquisition does need to be authenticated, select Yes, then specify the authentication method. You may also need to specify additional authentication information, such as form data, proxy server information, and cookie values.

                Specifying the Authentication Method

                If you select Yes in the Do you need authentication to access these documents? field, you must select one of the following methods for authenticating the content acquisition process.

                • BASIC specifies to use plain text user name and password information. You can also specify an optional Kerberos realm, which may be required for environments that are configured to trust non-Windows Kerberos realms.

                • NTLM specifies to use Windows NT LAN Management protocol for access to servers using Integrated Windows Authentication for HTTP authentication. You must specify the domain in which the user name and password are valid, and you can also specify an optional Kerberos realm, which may be required for environments that are configured to trust non-Windows Kerberos realms.

                • NONE enables you to specify only required additional authentication information, such as form data, proxy server information, and cookie values, if no user name and password are required.

                The following table describes the fields to authenticate content acquisition.

                Table Content Acquisition Authentication Fields

                Field Description

                Authentication Method

                Select the appropriate authentication method:
                • BASIC

                • NTLM

                • NONE

                Username

                Specify a valid username to access the web site(s).

                Password

                Specify the password for the username.

                Realm

                Specify an optional Kerberos realm, which may be required for environments that are configured to trust non-Windows Kerberos realms.

                Domain

                In NTLM environments only, specify the domain that the user and password are valid within.

                  Specifying Additional Authentication Data

                  When you define an external collection, you may need to specify additional authentication information, including the following details.

                  • Form data, so that the content acquisition process can properly complete any html-based forms to supply information that the websites requires in order to access content.

                  • Proxy server information, so that content acquisition can use an HTTP proxy server to access the content.

                  • Cookie values, so that content acquisition can supply required cookies to maintain a session or state while accessing content.

                    Specifying Form Data

                    You specify form data as a set of three types of information.

                    • A form action

                    • One or more field names associated with the action

                    • A value for each specified field name

                    You can specify multiple sets of form data; for each form action, you can specify multiple field name and value pairs.

                    The following table describes the fields you complete to specify form data.

                    Table Specifying Form Data Fields

                    Field Description

                    Form Action

                    Enter the URL at which the content acquisition process enters the associated field name-value pairs.

                    Field Name

                    Enter the field name in plain text.

                    Field Value

                    Enter the value for the specified field name in plain text.

                      Specifying Proxy Server Information

                      You specify proxy server information as a set of header data and one or more key-value pairs that the content acquisition process adds to the header information to access the proxy server.

                      The following table describes the proxy server information fields.

                      Table Proxy Server Fields

                      Field Description

                      HTTP Header Data

                      The header information.

                      Header Key

                      The header key.

                      Header Value

                      The header value.

                        Specifying Content Acquisition for an External Collection

                        You must describe the collection’s contents by specifying how the content acquisition process behaves when it collects content from the host web sites. You specify the web sites by entering one or more starting point URLs. The content acquisition process begins at these specified points, and collects documents according to the content acquisition parameters that you specify, including the following data.

                        • The number of successive links from the start point that the crawler content acquisition processes

                        • The types of documents to include and exclude

                        • The URLs of any sitemaps and individual documents that you want to explicitly include in the collection

                        • The URLs that the application can use to display collection documents to end users, if they are different than those that the content acquisition process uses to access the documents

                        The following table describes the web site specifications for the content acquisition process.

                        Table Web Site Specifications for Content Acquisition Process Fields

                        Field Description

                        How many URL levels do you want to include?

                        Specify the number of URL levels (crawl depth) to include in the collection. The default is 10 levels.

                        For example, a value of 4 specifies that the content acquisition process only includes four links (->) from the starting point page:

                        Starting point page -> level 2 page (linked from starting point) -> level 3 page (linked from level 2) -> level 4 page (linked from level 3)

                        Do you have a sitemap.xml or web document url?

                        • If you have sitemaps or web document URLs, select Yes. Add them to the Sitemap URL or Web Document URL field.

                        • If you do not have sitemaps or Web Document URLs, select No. Specify where to begin acquiring data and how deep to crawl in the Starting point URLs fields.

                        Starting point URLs

                        Specify one or more top-level URLs for the collection. The content acquisition process starts at each URL, and collects documents and follows links as specified in its configuration.

                        File name patterns to include or exclude

                        Specify one or more optional accept or reject document patterns. Document patterns are regular expressions that logically define desired document characteristics. Enter each pattern in a separate field. Content acquisition accepts all documents by default; in most cases you do not need to specify explicit document acceptance patterns.

                        Web Document URL

                        Specify any specific documents that otherwise are not accessed by the collection configuration.

                        Custom Display URL

                        Specify whether the user interface displays documents using a different URL than the collection configuration.

                        Using XML Sitemaps for Content Acquisition

                        Oracle recommends that you use XML sitemaps to define the contents of external collections. Sitemaps are structured indexes created specifically for use by search engines. They list the URLs of the documents in your site and include important metadata about each document, such as when it was last updated, and how frequently it changes. Using XML sitemaps is the most efficient way to specify the documents that content processing will include in a web collection.

                        You can use existing XML sitemaps, or create new sitemaps using any of the many available tools. You can also manually create sitemap files.

                        The following example shows a simple XML sitemap.

                        <?xml version="1.0" encoding="UTF-8"?>
                        
                        	<urlset>xmlns="http://www.exampleofsitemap.com/schemas/oursitemap/0.5
                        
                        	<url>
                        		<loc>http://www.location.com/</loc>
                        		<lastmod>2018-01-01</lastmod>
                        		<changefreq>monthly</changefreq>
                        		<priority>0.6</priority>
                        
                        	</url>
                        
                        </urlset>

                        Completing an External Collection

                        You complete an external collection by specifying its enablement status and setting its size limit. You set enablement status to save, test and review, and ultimately add a collection to your application. You specify size limits to optimize the content processing process and index size for a collection based on the documents it contains and your organization’s requirements.

                          Enable a Collection

                          To enable a collection, select one of the following options:

                          • Keep the collection definition, but do not index for searching. The content in the collection is not available to users. Use this option to disable a collection.

                          • Place the collection in review, so that it is included in the search index, but is not available to users. Use this option to test the collection before making the content available to users.

                          • Enable this collection to go live, so that its contents are added to the search index, and are available to end users.

                          If you re-enable a previously disabled collection, content processing and indexing begins with the next scheduled content processing for the type of collection.

                            Specifying the Size Limit for a Collection

                            You specify a size limit for a collection by selecting a pre-defined size limit type or defining a custom limit. You specify size limits to optimize the content processing process and index size for a collection based on the documents it contains and your organizations requirements.

                            The following table describes the size limits for a collection.

                            Table Size Limits for a Collection

                            Size Limit Collection Contents

                            Regular

                            Small and medium-sized documents, most of which require complete indexing. Content processing:
                            • Indexes only the first 100KB of each document

                            • Does not index documents larger than 5MB. This is the default collection size limit.

                            Archive

                            Small and medium-sized documents that can be returned as answers based on content that occurs near the beginning of the document, and therefore do not require complete indexing. Content processing:
                            • Indexes only the first 2KB of each document

                            • Does not index documents larger than 2MB

                            For Archive collections, content processing indexes only the important sections of documents, thereby enabling faster content processing and reduced memory use.

                            Manuals

                            Small, medium and large documents that require indexing of virtually all pages. Content processing:
                            • Indexes only the first 500KB of each document

                            • Does not index documents larger than 5MB

                            For Manuals collections, content processing indexes entire documents (up to 500KB), thereby enabling more complete and detailed search results based on the entire document contents. This setting requires greater content processing resources, resulting in increased content processing time and greater memory use.

                            Custom

                            Documents of any size for which the pre-defined size types do not meet collection requirements. You can specify limits on:
                            • The amount of each document’s content that is indexed, using the Trim Index After field

                            • The maximum size of documents to be indexed, using the Skip Documents Larger Than field

                            Note: You can leave these fields empty to specify no limit. You can specify values in KB, MB, and GB.

                              Editing an External Collection

                              You can edit any configuration settings within existing collections using the either the collection wizard or the collection form. You edit an external collection by selecting its name in the existing collections list on the Manage Collections page. When you select a collection and a method for editing, you can change the following data.

                              • The collection’s name, display label, and basic information

                              • Whether the content acquisition process needs authorization, the authentication method, and additional authentication information

                              • The collection description

                              • The validation rules

                              • Whether the collection is active

                              Delete or Disable a Collection

                              You can delete external collections from Knowledge Advanced. When you delete an external collection, the following operations occur.

                              • Knowledge Advanced removes the configuration information and all indexing data.

                              • Content processing stops collecting and indexing the documents in the collection.

                              • The collection’s contents are not available as answers after the next content processing cycle completes.

                              You cannot delete internal collections from the Manage Collections page. You delete an internal collection by deleting the corresponding authoring content type.

                              Use the following procedure to delete an external collection:

                              1. In the Existing Collections, select the collection you want to delete.

                              2. In the Edit <collection_name> dialog, select Collection Form.

                              3. Scroll down to Complete the Collection.

                              4. For Enable this Collection, select Delete this Collection.

                              5. Select Save.

                              6. Confirm the delete action.

                              Knowledge Advanced deletes all of the collection configuration information, and the Manage Collections page removes the collection from the list.

                              You can disable an external collection rather than deleting one. When you disable an external collection:

                              • Knowledge Advanced preserves the configuration information but removes all indexing data.

                              • The collection still appears in Existing Collections with the status of Disabled.

                              • Content processing stops collecting and indexing the documents in the collection.

                              • The collection contents are not available as answers after the next content processing cycle completes.

                              Use the following procedure to disable an external collection.

                              1. In the Existing Collections, select the collection you want to disable.

                              2. In the Edit <collection_name> dialog, select Collection Form.

                              3. Scroll down to Complete the Collection.

                              4. For Enable this Collection, select Keep the collection definition, but do not index for searching.

                              5. Select Save.

                              Viewing Collection Information

                              You can view detailed information about internal and external collections, including the following details.

                              • How the application processed the collection content.

                              • The current validation rules.

                              • Documents included in the collection.

                              You view collection details by selecting View Detail in the Actions Column of the Existing Collections list on the Manage Collections page.

                              You can also use the View Detail page to create or edit collection validation rules and document titles and locales.

                              Note: This information pertains to the documents that were crawled but not indexed. For example you may see the number 5 in the Number of Documents field, but only 2 documents appear in the Documents in this Collection table. This means that the missing 3 documents were indexed and therefore do not appear in the table.

                                Reviewing Collection Processing Information

                                You can view collection details by selecting View Detail in the Actions Column of the Existing Collections list on the Manage Collections page.

                                The Collection Processing Detail section of the View Detail page shows processing information for a collection. Processing information helps you to determine whether the application is processing the collection and creating the index as required.

                                The following table lists and describes the fields on the Manage Collections page.

                                Table Collection Processing Detail Columns

                                Field Description

                                Status

                                The most recently completed content processing step.

                                Number of Documents

                                The number of documents that the content acquisition process collected in its most recent cycle.

                                Number of Published Documents

                                The number of documents that content processing indexed in the most recent content processing run. This count includes each translation. For example, if you have one document in 5 languages in a content type, the count is 5.

                                Raw Size

                                The size of the collection after content acquisition.

                                Index Size

                                The size of the search index for the collection.

                                  About Content Processing Schedules

                                  The application automatically schedules both incremental and full processing jobs for internal and external collections. Content processing jobs render the collection content visible and available to your users for searching.

                                  Incremental jobs process only content in the collection that has changed, and full jobs process all content in a collection.

                                  The Collection Setup page shows the current scheduling for the following content processing jobs:

                                  • KB Incremental content update. This job processes changes to knowledge base (internal) content. The application schedules internal incremental processing to run every 15 minutes. This job processes a maximum of 10,000 documents at a time, then reschedules itself to process the remaining documents in batches of 10,000.

                                  • KB Full content update. This job processes the knowledge base collection content. The application schedules full processing for internal collections once a week. This job processes a maximum of 10,000 documents at a time, and the remaining documents are processed in batches of 10,000 by the KB Incremental content update job.

                                  • Web Sitemap Incremental content update. This job processes changes or additional sitemaps that define collections content. The application schedules incremental sitemap processing overnight when sitemaps have been updated or changed.

                                  • Web Full content update. This job processes external collections content. The application schedules full processing to run once per week.

                                  • Index. This job indexes the updated documents. The application schedules these processing jobs automatically, every 15 minutes. The scheduled time and day appear on this button: Index Processing is scheduled to be run at: {date} at {time}. The indexing and maintenance jobs perform the index cleanup tasks.

                                  • Indexing and Maintenance. The indexing and maintenance jobs perform the index cleanup tasks. The application schedules these processing jobs automatically each midnight. The scheduled time and day appear on this button: Index Processing and Maintenance is scheduled to be run at: {date} at {time}.

                                  You can schedule content processing jobs as follows:

                                  • KB Full content update. You can schedule this job to run immediately by clicking KB Full content update is to be run on-demand. Otherwise, the application schedules this update once a week.

                                  • Web Full content update. You can schedule this job to run immediately. Click Web Full content is to be run on-demand . Otherwise, the application schedules this update once a week.

                                    Validating External Collections

                                    You can specify validation rules to ensure that you have defined the collection correctly, and that the content acquisition process is collecting the desired content from the configured web sites. You can specify validation rules for a new collection, and add or update rules for an existing collection.

                                    You specify validation rules by selecting one or more of the available rules, and supplying a value that the content processor uses to validate the corresponding aspect of the collection. You can specify validation rules for:

                                    • The minimum and maximum number of documents in the collection

                                    • Key documents that must be present in the collection

                                    Specifying validation rules is optional; you can specify any, all, or none of the available validation rules.

                                    The following table describes the fields to complete to specify validation rules.

                                    Table Validation Rules Fields

                                    Field Description

                                    The minimum number of documents in the collection

                                    The minimum number of number of documents that the collection must contain. If content acquisition collects fewer documents than the number specified, Knowledge Advanced issues a warning.

                                    The maximum number of documents in the collection

                                    The maximum number of number of documents that the collection must contain. If content acquisition collects more documents than the number specified, Knowledge Advanced issues a warning.

                                    Key documents that the collection must contain

                                    The URLs of one or more documents that the collection must contain. If content acquisition fails to collect one or more of these documents, Knowledge Advanced issues a warning.