Configuring document extractors

Create a JavaScript document extractor

Create a JavaScript document extractor that uses a function to extract attribute values. This type extractor is typically used this method in complex scenarios where an XPath expression will not suffice.

Note

You can use a JavaScript document extractor with the advanced web crawler and the API crawler sources.

To use JavaScript to extract attribute values from URLs that match the condition of a specific GLOB pattern:

On the menu bar, click Sources, and select the source you created.
On the Source Settings page, next to Document Extractors, click Edit.
To create a document extractor, on the Document Extractors page:
- In the Name field, enter a meaningful name for the extractor.
  
  For example, enter Home loans JS extractor.
- In the Extractor Type drop-down menu, click JS.
- Optionally, to ensure that this extractor's logic only applies to URLs that match a certain pattern, configure URLs to Match. To do this, in the URLs To Match field, click Add Matcher, select the TYPE of expression you want to use, and enter the VALUE of that expression.
  
  For example, to crawl all URLs in this format: <some text>/homeloans/<some text>, you can click Glob Expression in the TYPE drop-down menu and enter the VALUE as **/homeloans/**.* .
In the Taggers section, click Edit next to the first tag. Usually, it is the content tag.

In the tag editor, you see a sample JavaScript function that returns the description, name, type, image_url and url attributes. Search provides this sample to help with configuration.

Note

In a document extractor, you can create multiple taggers where each is linked to a unique tag.

For example, if you have one tagger with a JavaScript function that defines how to extract five attributes, you get one set of index documents where each document has five attributes. However, if you have three taggers, and each has a JavaScript function that defines how to extract one, you get three sets of index documents where each document has one attribute.

In the tag editor, edit the sample function or paste a new JavaScript function that returns attribute values.

Note

The function must use Cheerio syntax and must return an array of objects.

For example, paste:

function extract(request, response) {
  $ = response.body;
  curr_url = request.url;

  let url = request.url;
  let type, subtype;
  type = 'homes';

  if (url.includes('/moving/')) {
    subtype = 'Moving';
  } else if (url.includes('/plans-products/')) {
    subtype = 'Plans and products';
  } else if (url.includes('/partnerships/')) {
    subtype = 'Partnerships';
  } else if (url.includes('/referrals/')) {
    subtype = 'Referrals';
  } else {
    subtype = 'misc';
  }

  let name, subtitle, description, image, date, year;

  name = $('h1[class="main-title"]').text() !== null ? $('h1[class="main-title"]').text().replace(/\s\s+/g, ' ').trim() : '';
  subtitle = $('h2[class="htmlSubtitle"]').text() !== null ? $('h2[class="htmlSubtitle"]').text().replace(/\s\s+/g, ' ').trim() : '';
  description = $('div[class="bodycopy"]').text() !== null ? $('div[class="bodycopy"]').text().replace(/\s\s+/g, ' ').trim() : '';
  image = $('div[class="hero-img"]').find('img').attr('src') !== null ? 'https://www.txu.com' + $('div[class="hero-img"]').find('img').attr('src') : '';

  return [{
    'name': name,
    'subtitle': subtitle,
    'description': description,
    'image_url': image,
    'type': type,
    'subtype': subtype,
    'url': url
  }];
}

This function uses the following logic to get attributes:

type - use a fixed value, for example, homes.
subtype - variable that can take one of five values depending on the URL path.
url - get this from the request.url.
name - if the text from the <h1 class="main-title"> tag has a value, use a modified form of that text as the Name. If it is empty, use an empty string for Name.
subtitle - if the text from the <h2 class="htmlSubtitle"> tag has a value, use a modified form of that text as the Name. If it is empty, use an empty string for Name.
description - if the text from the <bodycopy><div> class has a value, use a modified form of that text as the Description. If it is empty, use an empty string for Description.
image_url - if the img src attribute of the hero-imgdiv class has a value, use a modified form of that URL. If it is empty, use an empty string for Image_url.

In the tag editor, click Save.
Optionally, to extract attributes for another tag, click Add Tagger, and in the Tag drop-down menu, click a tag and repeat Steps 5 and 6.
On the Document Extractors page, click Save.

If you have suggestions for improving this article, let us know!