|
Use logging extensively to provide feedback during a crawl. In some cases, the portal reports a successful crawl when there
were minor errors. Use Log4J or Log4Net to track progress. For more
information, see Logging and Troubleshooting. |
|
Use relative URLs in your code to allow migration to another
remote server. Note: These URLs might be relative to different
base URL endpoints. The click-through URL is relative to the remote
server base URL, and the indexing URL is relative to the SOAP URL.
Depending on whether you have implemented your Content Crawler using
Java or .NET, the base URL endpoint for the remote server might differ
from the base URL endpoint for SOAP. For example, the Java IDK uses
Axis, which implements programs as services. In Axis, the SOAP URL
is the remote server base URL with '/services' attached to the end.
Given the remote server base URL http://server:port/sitename, the
SOAP URL would be http://server:port/sitename/services. If both click-through
and indexing URLs point to the same servlet (http://server:port/sitename/customdocfetch?docId=12345),
the relative URLs would be different. The relative URL for indexing
would be "../customdocfetch?docId=12345" and the relative URL for
click-through would be "customdocfetch?docId=12345". (Since the indexing
URL is relative to the SOAP URL, the '../' reorients the path from
http://server:port/sitename/services to http://server:port/sitename,
yielding the correct URL to http://server:port/sitename/customdocfetch?docId=12345.) |
|
Do your initial implementation of IDocumentProvider and
IDocFetchProvider in separate classes, but factor out some code
to allow reuse of the GetDocument and GetMetaData methods. See the
Viewer sample application included with the IDK for sample code. |
|
Do not make your calls order-dependent. The portal
can make the above calls in any order, so your code cannot be dependent
on order. |
|
If a document or container does not exist, always throw
a new NoLongerExistsException. This is the only way the portal
can determine if the file or folder has been deleted. Not throwing
the exception could result in an infinite loop. |
|
If there are no results, return a zero-length array. If your intention is to return no results, use a zero-length array,
not an array with empty strings. (For example, return new
ChildContainer[0]; ) |
|
Check the SOAP timeout for the back-end server and calibrate
your response accordingly. In version 5.0 and above, the SOAP
timeout is set in the Web Service editor. In version 4.5, the SOAP
timeout must be set via a Service Configuration page. |
|
Pages that are not publicly accessible must be gatewayed. Gateway settings are configured in the Web Service editor on the
HTTP Configuration page, and in the Content Source editor. You can
gateway all URLs relative to the remote server or enter individual
URLs and add paths to other servers to gateway additional pages. For
details, see Deploying Custom Content Crawlers. |
|
You must define mappings for any associated Content Types
before a Content Crawler is run. The portal uses the mappings
in the Content Type definition to map the data returned by the Content
Crawler to portal properties. Properties are only stored if you configure
the Content Type mapping before running the Content Crawler. (Properties
that apply to all documents are configured in the Global Document
Property Map.) |
|
To import security settings, the backend repository must
have an associated Authentication Source. Content Crawlers that
import security need the user and category (domain) defined by an
Authentication Source. You must configure the Authentication Source
before the Content Crawler is run. Many repositories use the networks
NT or LDAP security store; if an associated Authentication Source
already exists, there is no need to create one. For details on Authentication
Sources, see the portal online help. For details on security, see
Configuring Custom Content Crawlers : Importing File Security. |
|
If you use a mirrored crawl, only run it when you first
import documents. Always check every directory after a mirrored
crawl. After you have imported documents into the portal, it is safer
to refresh your portal directory using a regular crawl with filters. |
|
For mirrored crawls, make crawl depth as shallow as possible. Portal users want to access documents quickly, so folder structure
is important. Also, the deeper the crawl, the more extensive your
QA process will be. |
|
Use filters to sort crawled documents into portal folders. Mirrored crawls can return inappropriate content and create unnecessary
directory structures. Filters are a more efficient way to sort crawled
documents. To use filters, choose Apply Filter of Destination Folder
in the Content Crawler editor. For details on filters, see the portal
online help. |
|
Do not use automatic approval unless you have tested a
Content Crawler. It is dangerous to use automatic approval without
first testing the structure, metadata and logs for a Content Crawler. |
|
To clear the deletion history, you must re-open the Content
Crawler editor. To re-crawl documents that have been deleted from
the portal, you must re-open the Content Crawler editor and configure
the Importing Documents settings on the Advanced Settings page as
explained in Deploying Custom Content Crawlers. |