Configuring document extractors

PDF content

We've provided some sample JavaScript document extractors to help you extract attribute values from PDF content. Each example has an extract of the HTML structure of the PDF and a corresponding document extractor.

Sample 1: Simple JavaScript extractor to get attribute values from PDF content

This example shows how to create a simple JavaScript extractor that uses conditional functions to get attribute values.

The PDF is available at https://archive.doc.sitecore.com/xp/en/legacy-docs/web-forms-for-marketers-8.0.pdf. Here's what the HTML structure looks like, shortened for brevity:

<head>
  <meta name="pdf:PDFVersion" content="1.6">
  <meta name="xmp:CreatorTool" content="Adobe Acrobat Standard DC 18.11.20058">
  <meta name="dc:description" content="All the official Sitecore documentation.">
  <meta name="dc:creator" content="John Doe">
  <meta name="dcterms:created" content="2018-09-13T09:53:31Z">
  <meta name="dcterms:modified" content="2018-09-13T09:53:31Z">
  <meta name="dc:format" content="application/pdf; version=1.6">
  ...
  <title>�</title>
</head>

<body>
  <div class="page">
    <p> </p>
    <p> </p>
    <p>Web Forms for Marketers 8.0 Rev: September 13, 2018 </p>
    <p> </p>
    <p> </p>
    <p>Web Forms for Marketers 8.0 </p>
    <p>All the official Sitecore documentation. </p>
  </div>
  <div class="page">
    <p> </p>
    <p>Add an ASCX control to the page In the Web Forms for Marketers module, you can convert and export a form to an
      .ascx file and then add it to your website as an ASCX control. For developers, this can make it easier to develop
      their custom form control. </p>
    <p>To add an ASCX control to the page: </p>
    <p>1. Using a text editor, in the \layouts folder of your Sitecore installation, create a new default.aspx page and
      insert the following code: </p>
    <p>&lt;%@ Page Language="C#" AutoEventWireup="true" %&gt; </p>
    <p>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" </p>
    <p>"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt; </p>
    <p>&lt;html xmlns="http://www.w3.org/1999/xhtml" &gt; </p>
    <p>&lt;head runat="server"&gt; </p>
    <p> &lt;title&gt;Untitled Page&lt;/title&gt; </p>
    <p>&lt;/head&gt; </p>
    <p>&lt;body&gt; </p>
    <p> &lt;form id="form1" runat="server"&gt; </p>
    <p> &lt;/form&gt; </p>
    <p>&lt;/body&gt; </p>
    <p>&lt;/html&gt; </p>
    ....
    <p> </p>
  </div>
  </body>

Here's a sample JavaScript document extractor to extract the name, type, website, description, abstract, author , and last_modified attributes from this PDF:

function extract(request, response) {
    $ = response.body;

    return [{
        'name':  $('div.page:eq(0)').text().trim().substring(0, 40) || 'No Name',
        'type': 'pdf',
        'website':'Sitecore Documentation',
        'description' : $('div.page:eq(0)').text().trim().substring(0, 100) || 'No Description',
	'author': $('meta[name="dc:creator"]').attr('content') || 'No Author',
	'last_modified': $('meta[name="dcterms:created"]').attr('content') || 'No Last Modified Date'
    }];
}

This function uses the following logic to get attribute values:

name - trim text in the first div tag with the class of page. Then, use only the first 40 characters. If there is no text, set the value to No Name.
type - use a fixed value, pdf.
website - use a fixed value, Sitecore Documentation.
description - use the text in the first <div> tag of the class page, taking only the first 100 characters. If there is no text, set the value to No Description.
author - use the content value of the first meta tag with the name of dc:creator. If there is no text, set the value to No Author.
last_modified - use the text in the first meta tag with the name of dcterms:created. If there is no text, set the value to No Last Modified Date.

Sample 2: Complex JavaScript extractor to get attribute values from PDF content

This example shows how to create a complex JavaScript extractor with many nested functions to get the attribute values you want. It also defines how to extract a parent_url attribute to track the parent page the PDF is on.

The PDF is available at https://www.sitecore.com/customers/associations/us-masters-swimming, by clicking Download case study. Here's what the HTML structure looks like, shortened for brevity:

<head>
    <meta name="pdf:PDFVersion" content="1.4">
    <meta name="xmp:CreatorTool" content="Adobe InDesign 15.0 (Macintosh)">
    <meta name="dcterms:created" content="2020-01-28T22:16:01Z">
    <meta name="dcterms:modified" content="2020-01-28T22:16:02Z">
    ...
    <title>�</title>
</head>

<body>
    <div class="page">
        <p> </p>
        <p> </p>
        <p> </p>
        <p>Industry: Associations • Founded: 1970 • Employees: 17 </p>
        <p>Headquarters: Boca Raton, Florida, USA • usms.org </p>
        <p>....
        <p> </p>
        <div class="annotation"> <a href="https://usms.org">https://usms.org</a> </div>
    </div>
    <div class="page">
        <p> </p>
        <p> </p>
        <p>Sitecore is the global leader in experience management software that combines content management, commerce,
            and customer insights. The Sitecore® Experience Cloud™ empowers marketers </p>
       ...
        <div class="annotation"> <a href="https://sitecore.com">https://sitecore.com</a> </div>
        <div class="annotation"> <a href="https://buildabonfire.com ">https://buildabonfire.com </a> </div>
    </div>
</body>

Here's a sample JavaScript document extractor to extract the id, type, last_modified, name, description, and parent_url attributes from this PDF:

function extract(request, response) {
    const translate_re = /&(nbsp|amp|quot|lt|gt);/g;

    function decodeEntities(encodedString) {
        return encodedString.replace(translate_re, function(match, entity) {
            return translate[entity];
        }).replace(/&#(\d+);/gi, function(match, numStr) {
            const num = parseInt(numStr, 10);
            return String.fromCharCode(num);
        });
    }

    function sanitize(text) {
        return text ? decodeEntities(String(text).trim()) : text;
    }

    $ = response.body;
    url = request.url;
    id = url.replace(/[.:/&?=%]/g, '_');
    name = sanitize($('name').text());  
    description = $('body').text().substring(0, 7000);

    $p = request.context.parent.response.body;
    if (name.length <= 4 && $p) {
        name = $p('name').text();  
    }

    parentUrl = request.context.parent.request.url;
    last_modified = request.context.parent.documents[0].data.last_modified;

    return [{
        'id': id,
        'type': "pdf",
        'parent_url': parentUrl
        'last_modified': last_modified,
        'name': name,  //
        'description': description,
        
    }];
}

This function uses the following logic to get attribute values:

id - replace special characters in the URL with an underscore (_).
type - use a fixed value, pdf.
parent_url - within the parent context (request.context.parent) access the request object. Then, access the url parameter.
last_modified - within the parent context (request.context.parent) access the first documents array (documents[0]). Then, access the data object of the URL's last_modified attribute.
name - use either the name HTML element or the parent document's name HTML element, as follows:
- First, sanitize the text of the <name> HTML element.
- Then, to check if the sanitized name is too short, see if the length is less than or equal to 4 characters.
- If the name is too short, AND the parent document has a defined body ($p), use the name tag from the parent.
description - sanitize the text of the <body> HTML element, limiting it to the first 7000 characters.

If you have suggestions for improving this article, let us know!