ドキュメント抽出器の設定

PDFコンテンツ

日本語翻訳に関する免責事項

このページの翻訳はAIによって自動的に行われました。可能な限り正確な翻訳を心掛けていますが、原文と異なる表現や解釈が含まれる場合があります。正確で公式な情報については、必ず英語の原文をご参照ください。

PDFコンテンツから属性値を抽出するのに役立つサンプルJavaScriptドキュメント抽出ツールをいくつか用意しています。各例には、PDFのHTML構造の抽出と、対応するドキュメント抽出ツールがあります。

サンプル1: PDFコンテンツから属性値を取得するための単純なJavaScript抽出ツール

この例では、条件関数を使用して属性値を取得する単純なJavaScript抽出器を作成する方法を示します。

PDFはhttps://archive.doc.sitecore.com/xp/en/legacy-docs/web-forms-for-marketers-8.0.pdfで入手できます。HTML構造は、簡潔にするために短縮して次のようになります。

<head>
  <meta name="pdf:PDFVersion" content="1.6">
  <meta name="xmp:CreatorTool" content="Adobe Acrobat Standard DC 18.11.20058">
  <meta name="dc:description" content="All the official Sitecore documentation.">
  <meta name="dc:creator" content="John Doe">
  <meta name="dcterms:created" content="2018-09-13T09:53:31Z">
  <meta name="dcterms:modified" content="2018-09-13T09:53:31Z">
  <meta name="dc:format" content="application/pdf; version=1.6">
  ...
  <title>�</title>
</head>

<body>
  <div class="page">
    <p> </p>
    <p> </p>
    <p>Web Forms for Marketers 8.0 Rev: September 13, 2018 </p>
    <p> </p>
    <p> </p>
    <p>Web Forms for Marketers 8.0 </p>
    <p>All the official Sitecore documentation. </p>
  </div>
  <div class="page">
    <p> </p>
    <p>Add an ASCX control to the page In the Web Forms for Marketers module, you can convert and export a form to an
      .ascx file and then add it to your website as an ASCX control. For developers, this can make it easier to develop
      their custom form control. </p>
    <p>To add an ASCX control to the page: </p>
    <p>1. Using a text editor, in the \layouts folder of your Sitecore installation, create a new default.aspx page and
      insert the following code: </p>
    <p>&lt;%@ Page Language="C#" AutoEventWireup="true" %&gt; </p>
    <p>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" </p>
    <p>"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt; </p>
    <p>&lt;html xmlns="http://www.w3.org/1999/xhtml" &gt; </p>
    <p>&lt;head runat="server"&gt; </p>
    <p> &lt;title&gt;Untitled Page&lt;/title&gt; </p>
    <p>&lt;/head&gt; </p>
    <p>&lt;body&gt; </p>
    <p> &lt;form id="form1" runat="server"&gt; </p>
    <p> &lt;/form&gt; </p>
    <p>&lt;/body&gt; </p>
    <p>&lt;/html&gt; </p>
    ....
    <p> </p>
  </div>
  </body>

このPDFからname、type、website、description、abstract、author 、およびlast_modified属性を抽出するためのサンプルJavaScriptドキュメント抽出ツールを次に示します。

function extract(request, response) {
    $ = response.body;

    return [{
        'name':  $('div.page:eq(0)').text().trim().substring(0, 40) || 'No Name',
        'type': 'pdf',
        'website':'Sitecore Documentation',
        'description' : $('div.page:eq(0)').text().trim().substring(0, 100) || 'No Description',
	'author': $('meta[name="dc:creator"]').attr('content') || 'No Author',
	'last_modified': $('meta[name="dcterms:created"]').attr('content') || 'No Last Modified Date'
    }];
}

この関数は、次のロジックを使用して属性値を取得します。

name- 最初のdivタグのテキストをpageのclassでトリミングします。次に、最初の40文字のみを使用します。テキストがない場合は、値をNo Nameに設定します。
type - 固定値pdfを使用します。
website - 固定値Sitecore Documentationを使用します。
description- クラスpageの最初の<div>タグのテキストを使用して、最初の100文字のみを使用します。テキストがない場合は、値をNo Descriptionに設定します。
author- 最初のmetaタグのコンテンツ値をnamedc:creatorで使用します。テキストがない場合は、値をNo Authorに設定します。
last_modified- 最初のmetaタグのテキストをnamedcterms:createdで使用します。テキストがない場合は、値をNo Last Modified Dateに設定します。

サンプル2: PDFコンテンツから属性値を取得するための複雑なJavaScript抽出ツール

この例では、必要な属性値を取得するために、多くのネストされた関数を含む複雑なJavaScript抽出器を作成する方法を示します。また、PDFが置かれている親ページを追跡するためにparent_url属性を抽出する方法も定義します。

PDFは、https://www.sitecore.com/customers/associations/us-masters-swimmingでDownload case studyをクリックします。HTML構造は、簡潔にするために短縮して次のようになります。

<head>
    <meta name="pdf:PDFVersion" content="1.4">
    <meta name="xmp:CreatorTool" content="Adobe InDesign 15.0 (Macintosh)">
    <meta name="dcterms:created" content="2020-01-28T22:16:01Z">
    <meta name="dcterms:modified" content="2020-01-28T22:16:02Z">
    ...
    <title>�</title>
</head>

<body>
    <div class="page">
        <p> </p>
        <p> </p>
        <p> </p>
        <p>Industry: Associations • Founded: 1970 • Employees: 17 </p>
        <p>Headquarters: Boca Raton, Florida, USA • usms.org </p>
        <p>....
        <p> </p>
        <div class="annotation"> <a href="https://usms.org">https://usms.org</a> </div>
    </div>
    <div class="page">
        <p> </p>
        <p> </p>
        <p>Sitecore is the global leader in experience management software that combines content management, commerce,
            and customer insights. The Sitecore® Experience Cloud™ empowers marketers </p>
       ...
        <div class="annotation"> <a href="https://sitecore.com">https://sitecore.com</a> </div>
        <div class="annotation"> <a href="https://buildabonfire.com ">https://buildabonfire.com </a> </div>
    </div>
</body>

このPDFからid、type、last_modified、name、description、およびparent_url属性を抽出するためのJavaScriptドキュメント抽出ツールのサンプルを次に示します。

function extract(request, response) {
    const translate_re = /&(nbsp|amp|quot|lt|gt);/g;

    function decodeEntities(encodedString) {
        return encodedString.replace(translate_re, function(match, entity) {
            return translate[entity];
        }).replace(/&#(\d+);/gi, function(match, numStr) {
            const num = parseInt(numStr, 10);
            return String.fromCharCode(num);
        });
    }

    function sanitize(text) {
        return text ? decodeEntities(String(text).trim()) : text;
    }

    $ = response.body;
    url = request.url;
    id = url.replace(/[.:/&?=%]/g, '_');
    name = sanitize($('name').text());  
    description = $('body').text().substring(0, 7000);

    $p = request.context.parent.response.body;
    if (name.length <= 4 && $p) {
        name = $p('name').text();  
    }

    parentUrl = request.context.parent.request.url;
    last_modified = request.context.parent.documents[0].data.last_modified;

    return [{
        'id': id,
        'type': "pdf",
        'parent_url': parentUrl
        'last_modified': last_modified,
        'name': name,  //
        'description': description,
        
    }];
}

この関数は、次のロジックを使用して属性値を取得します。

id - URLの特殊文字をアンダースコア (_) に置き換えます。
type - 固定値pdfを使用します。
parent_url - 親コンテキスト (request.context.parent) 内で、requestオブジェクトにアクセスします。次に、urlパラメーターにアクセスします。
last_modified- 親コンテキスト(request.context.parent)内で、最初のドキュメント配列(documents0)にアクセスします。次に、URLのlast_modified属性のdataオブジェクトにアクセスします。
name - 次のように、name HTML要素または親ドキュメントのname HTML要素のいずれかを使用します。
- まず、<name> HTML要素のテキストをサニタイズします。
- 次に、サニタイズされた名前が短すぎるかどうかを確認するには、長さが4文字以下であるかどうかを確認します。
- 名前が短すぎて、親ドキュメントに定義された本文 ($p) がある場合は、親のnameタグを使用します。
description - <body> HTML要素のテキストをサニタイズし、最初の7000文字に制限します。

この記事を改善するための提案がある場合は、お知らせください!