Sun Java System Portal Server 7 Release Notes

URLScraper Authorized Access

The URLScraper includes a mechanism to get authenticated content from different URL and scrape content from password protected site. The URLScraper makes a request to the specified URL along with the user credentials and the returned cookies are used for session tracking and in subsequent invocation to this site.

The administrator provides:

loginUrl

The loginUrl is the action attribute of the HTML form that is presented for user authentication. The loginUrl is different from the URL to be scraped . For example, to scrape http://my.yahoo.com, the loginUrl is http://login.yahoo.com/config/login.

loginFormData

The loginFormData contains user credentials as HTTP query parameters (such as the HTML form attributes that must be passed for authentication). Here, the keys are the HTML form attributes and the values are the user credentials that need to passed. The values in square brackets will be filled during runtime.

isHttpAuth

Specifies whether or not it isHttpAuth. Only Http-Basic Auth is supported at this time. The HTTP header is set with Authorization header using the user credentials.

formData

This is the data that needs to be posted while invoking the URL.