Setting Up Web Sources

A Web source enables users to search a Web site. The following procedures identify the basic steps for setting up a Web source using the Oracle SES Administration GUI. For more information on each page, click Help.

Oracle SES is configured to crawl Web sites on the intranet within the corporate fire wall. To crawl Web sites on the Internet (external Web sites), Oracle SES requires the HTTP proxy server information. See the Global Settings - Proxy Settings page.

You should review the default crawling parameters before you start crawling Internet sources.

To create a Web source: 

  1. On the Home page, select the Sources secondary tab to display the Sources page.

  2. For Source Type, select Web.

  3. Click Create to display the Create Web Source page.

  4. Complete the following fields:

    • Source Name: Name that you assign to this Web source.

    • Starting URLs : The HTTP or HTTPS address of the Web site, starting at the top page to be searched.

    • Self Service : Disabled to use an identity management system or Enabled to prompt users for their credentials.

    • Start Crawling Immediately : Select this option to accept the default parameters and begin crawling, or deselect it to defer crawling.

  5. Click Create or Create & Customize.

  6. Follow the steps for crawling and indexing a source in "Getting Started Basics for the Oracle SES Administration GUI".

Figure 6-1 shows the Create Web Source page.

Figure 6-1 Creating a Web Source

Create Web Source page
Description of "Figure 6-1 Creating a Web Source"

To customize a Web source: 

  1. When creating a Web source, click Create & Customize on the Create Web Source page to display the Customize Web Source page.


    After creating a source, click the Edit icon on the Home - Sources page.

  2. Click the following subtabs and make the desired changes.

    • Basic Settings: The choices entered on the Create Web Source page.

    • Boundary Rules: Contents of a URL that include or exclude a page from crawling.

    • Document Types: Common document and image types that you can include or exclude from crawling. By default, Oracle SES crawls HTML, Excel, PowerPoint, Word, PDF and plain text.

    • Authentication: Configuration of HTTP, HTML forms, or Oracle Single-Sign-On methods of authentication. By default, no authentication is required.

    • Authorization: Configuration of an Access Control List or an authorization manager plug-in.

    • Metatag Mappings: Maps document attributes to Oracle SES search attributes. See "Web Document Attributes".

    • Crawling Parameters: Sets a variety of crawling conditions, such as depth, language, HTTP cookies.

  3. Click Apply.

Figure 6-2 shows the Customize Web Source page.

Figure 6-2 Customizing a Web Source

Customize Web Source page
Description of "Figure 6-2 Customizing a Web Source"

Boundary Rules for Web Sources

When creating a Web source, the host name of the seed (top level URL) is automatically added to the boundary rule. However, subsequent changes to the seed URL are not reflected automatically to the rule. Remember to synchronize the boundary rule if there is any change to the seed URL. Currently, Oracle SES does not remove crawled URLs even if the original seed is removed: everything is controlled by the boundary rules.

Web Document Attributes

Oracle SES crawls and indexes these Web document attributes:

  • Title

  • Author

  • Description

  • Host

  • Keywords

  • Language

  • LastModifiedDate

  • Mimetype

  • Subject: Mapped to "Description". If there is no description metatag in the HTML file, then it is ignored.

  • Headline1: The highest H tag text; for example, "Annual Report" from <H2>Annual Report</H2> when there is no H1 tag in the page.

  • Headline2: The second highest H tag text

  • Reference Text: The anchor text from another Web page that points to this page.

You can define additional HTML metatags to map to a String attribute on the Home - Sources - Metatag Mapping page.