org.apache.nutch.parse
Class ParseData

java.lang.Object
  extended by org.apache.hadoop.io.VersionedWritable
      extended by org.apache.nutch.parse.ParseData
All Implemented Interfaces:
Configurable, Writable

public final class ParseData
extends VersionedWritable
implements Configurable

Data extracted from a page's content.

See Also:
Parse.getData()

Field Summary
static String DIR_NAME
           
 
Constructor Summary
ParseData()
           
ParseData(ParseStatus status, String title, Outlink[] outlinks, Metadata contentMeta)
           
ParseData(ParseStatus status, String title, Outlink[] outlinks, Metadata contentMeta, Metadata parseMeta)
           
ParseData(ParseStatus status, String title, Outlink[] outlinks, Metadata contentMeta, Metadata parseMeta, DocumentFragment root, HTMLMetaTags metaTags)
           
 
Method Summary
 boolean equals(Object o)
           
 Configuration getConf()
          Return the configuration used by this object.
 Metadata getContentMeta()
          The original Metadata retrieved from content
 DocumentFragment getDOMRoot()
          Retrieve the DOM, if there is one.
 String getMeta(String name)
          Get a metadata single value.
 HTMLMetaTags getMetaTag()
          Returns the HTML meta tags which are populated by parsing the meta tags in the head of an HTML document.
 Outlink[] getOutlinks()
          The outlinks of the page.
 Metadata getParseMeta()
          Other content properties.
 ParseStatus getStatus()
          The status of parsing the page.
 String getTitle()
          The title of the page.
 byte getVersion()
          Return the version number of the current implementation.
 int hashCode()
           
static ParseData read(DataInput in)
           
 void readFields(DataInput in)
          Reads the fields of this object from in.
 void setConf(Configuration conf)
          Set the configuration to be used by this object.
 void setDOMRoot(DocumentFragment root)
          Set the DOM.
 void setMetaTag(HTMLMetaTags metaTags)
           
 void setParseMeta(Metadata parseMeta)
           
 String toString()
           
 void write(DataOutput out)
          Writes the fields of this object to out.
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

DIR_NAME

public static final String DIR_NAME
See Also:
Constant Field Values
Constructor Detail

ParseData

public ParseData()

ParseData

public ParseData(ParseStatus status,
                 String title,
                 Outlink[] outlinks,
                 Metadata contentMeta)

ParseData

public ParseData(ParseStatus status,
                 String title,
                 Outlink[] outlinks,
                 Metadata contentMeta,
                 Metadata parseMeta)

ParseData

public ParseData(ParseStatus status,
                 String title,
                 Outlink[] outlinks,
                 Metadata contentMeta,
                 Metadata parseMeta,
                 DocumentFragment root,
                 HTMLMetaTags metaTags)
Method Detail

getStatus

public ParseStatus getStatus()
The status of parsing the page.


getTitle

public String getTitle()
The title of the page.


getOutlinks

public Outlink[] getOutlinks()
The outlinks of the page.


getContentMeta

public Metadata getContentMeta()
The original Metadata retrieved from content


getParseMeta

public Metadata getParseMeta()
Other content properties. This is the place to find format-specific properties. Different parser implementations for different content types will populate this differently.


setParseMeta

public void setParseMeta(Metadata parseMeta)

getMeta

public String getMeta(String name)
Get a metadata single value. This method first looks for the metadata value in the parse metadata. If no value is found it the looks for the metadata in the content metadata.

See Also:
getContentMeta(), getParseMeta()

getDOMRoot

public DocumentFragment getDOMRoot()
Retrieve the DOM, if there is one. The DOM has been added to the ParseData object so that downstream plugins (such as ParseFilters and RecordGenerators) have access to it. Currently, the HTML parser is the only parser that generates a DOM.


setDOMRoot

public void setDOMRoot(DocumentFragment root)
Set the DOM. The DOM has been added to the ParseData object so that downstream plugins (such as ParseFilters and RecordGenerators) have access to it. Currently, the HTML parser is the only parser that generates a DOM.

Parameters:
root -

getMetaTag

public HTMLMetaTags getMetaTag()
Returns the HTML meta tags which are populated by parsing the meta tags in the head of an HTML document.


setMetaTag

public void setMetaTag(HTMLMetaTags metaTags)

getVersion

public byte getVersion()
Description copied from class: VersionedWritable
Return the version number of the current implementation.

Specified by:
getVersion in class VersionedWritable

readFields

public final void readFields(DataInput in)
                      throws IOException
Description copied from interface: Writable
Reads the fields of this object from in. For efficiency, implementations should attempt to re-use storage in the existing object where possible.

Specified by:
readFields in interface Writable
Overrides:
readFields in class VersionedWritable
Throws:
IOException

write

public final void write(DataOutput out)
                 throws IOException
Description copied from interface: Writable
Writes the fields of this object to out.

Specified by:
write in interface Writable
Overrides:
write in class VersionedWritable
Throws:
IOException

read

public static ParseData read(DataInput in)
                      throws IOException
Throws:
IOException

equals

public boolean equals(Object o)
Overrides:
equals in class Object

hashCode

public int hashCode()
Overrides:
hashCode in class Object

toString

public String toString()
Overrides:
toString in class Object

setConf

public void setConf(Configuration conf)
Description copied from interface: Configurable
Set the configuration to be used by this object.

Specified by:
setConf in interface Configurable

getConf

public Configuration getConf()
Description copied from interface: Configurable
Return the configuration used by this object.

Specified by:
getConf in interface Configurable


Copyright © 2007, 2012, Oracle and/or its affiliates. All rights reserved.