PDF content
We've provided some sample JavaScript document extractors to help you extract attribute values from PDF content. Each example has an extract of the HTML structure of the PDF and a corresponding document extractor.
Sample 1: Simple JavaScript extractor to get attribute values from PDF content
This example shows how to create a simple JavaScript extractor that uses conditional functions to get attribute values.
The PDF is available at https://archive.doc.sitecore.com/xp/en/legacy-docs/web-forms-for-marketers-8.0.pdf. Here's what the HTML structure looks like, shortened for brevity:
Here's a sample JavaScript document extractor to extract the name, type, website, description, abstract, author , and last_modified attributes from this PDF:
This function uses the following logic to get attribute values:
-
name- trim text in the firstdivtag with theclassofpage. Then, use only the first 40 characters. If there is no text, set the value to No Name. -
type- use a fixed value, pdf. -
website- use a fixed value, Sitecore Documentation. -
description- use the text in the first<div>tag of the classpage, taking only the first 100 characters. If there is no text, set the value to No Description. -
author- use the content value of the firstmetatag with thenameofdc:creator. If there is no text, set the value to No Author. -
last_modified- use the text in the firstmetatag with thenameofdcterms:created. If there is no text, set the value to No Last Modified Date.
Sample 2: Complex JavaScript extractor to get attribute values from PDF content
This example shows how to create a complex JavaScript extractor with many nested functions to get the attribute values you want. It also defines how to extract a parent_url attribute to track the parent page the PDF is on.
The PDF is available at https://www.sitecore.com/customers/associations/us-masters-swimming, by clicking Download case study. Here's what the HTML structure looks like, shortened for brevity:
Here's a sample JavaScript document extractor to extract the id, type, last_modified, name, description, and parent_url attributes from this PDF:
This function uses the following logic to get attribute values:
-
id- replace special characters in the URL with an underscore (_). -
type- use a fixed value, pdf. -
parent_url- within the parent context (request.context.parent) access therequestobject. Then, access theurlparameter. -
last_modified- within the parent context (request.context.parent) access the first documents array (documents[0]). Then, access the data object of the URL's last_modified attribute. -
name- use either thenameHTML element or the parent document'snameHTML element, as follows:-
First, sanitize the text of the
<name>HTML element. -
Then, to check if the sanitized name is too short, see if the length is less than or equal to 4 characters.
-
If the name is too short, AND the parent document has a defined body ($p), use the
nametag from the parent.
-
-
description- sanitize the text of the<body>HTML element, limiting it to the first 7000 characters.