org.apache.nutch.protocol
Interface Protocol

All Superinterfaces:
Configurable, Pluggable

public interface Protocol
extends Pluggable, Configurable

A retriever of url content. Implemented by protocol extensions.


Field Summary
static String CHECK_BLOCKING
          Property name.
static String CHECK_ROBOTS
          Property name.
static String X_POINT_ID
          The name of the extension point.
 
Method Summary
 ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum)
          Returns the Content for a fetchlist entry.
 RobotRules getRobotRules(Text url, CrawlDatum datum)
          Retrieve robot rules applicable for this url.
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

X_POINT_ID

static final String X_POINT_ID
The name of the extension point.


CHECK_BLOCKING

static final String CHECK_BLOCKING
Property name. If in the current configuration this property is set to true, protocol implementations should handle "politeness" limits internally. If this is set to false, it is assumed that these limits are enforced elsewhere, and protocol implementations should not enforce them internally.

See Also:
Constant Field Values

CHECK_ROBOTS

static final String CHECK_ROBOTS
Property name. If in the current configuration this property is set to true, protocol implementations should handle robot exclusion rules internally. If this is set to false, it is assumed that these limits are enforced elsewhere, and protocol implementations should not enforce them internally.

See Also:
Constant Field Values
Method Detail

getProtocolOutput

ProtocolOutput getProtocolOutput(Text url,
                                 CrawlDatum datum)
Returns the Content for a fetchlist entry.


getRobotRules

RobotRules getRobotRules(Text url,
                         CrawlDatum datum)
Retrieve robot rules applicable for this url.

Parameters:
url - url to check
datum - page datum
Returns:
robot rules (specific for this url or default), never null


Copyright © 2007, 2012, Oracle and/or its affiliates. All rights reserved.