Dashboard > Hippo CMS > ... > Hippo Repository v1.2.x Documentation > 4. Hippo Repository Configure Extractors
4. Hippo Repository Configure Extractors
Added by Arjé Cahn, last edited by Jasha Joachimsthal on Aug 12, 2008  (view change)
Labels: 


Table of Contents

Using extractors

It is possible to extract content from different kind of documents and attach them as properties on a document each time it's written to Hippo Repository. This is useful when running DASL's against the repository.

Defining extractors

Extractors are defined in the extractors.xml file within the Hippo Repository configuration.

In the example below you can see the definition of an extractor.

<extractor classname="o.a.SomeClass" uri="/some/path" content-type="text/xml | someother/type"/>
  • classname: this is the classname of the extractor (for example org.apache.slide.extractor.MSWordExtractor)
  • uri: this is the path in the repository (for example /files/default.preview/binaries)
  • content-type : this is the content type of the resource that the extractor should read (for example text/xml)

Multiple content types are separated by "|" and when the contenttype is matched, everything after a ";" is discarded.

Example configuration

<extractors>
    <extractor classname="org.apache.slide.extractor.SimpleXmlExtractor" uri="/files/articles/">
        <configuration>
            <instruction property="title" xpath="/article/title/text()" />
            <instruction property="summary" xpath="/article/summary/text()" />
        </configuration>
    </extractor>
    <extractor classname="org.apache.slide.extractor.OfficeExtractor" uri="/files/docs/">
        <configuration>
            <instruction property="author" id="SummaryInformation-0-4" />
            <instruction property="application" id="SummaryInformation-0-18" />
        </configuration>
    </extractor>
</extractors>

Preview and live

A typical Hippo CMS / Hippo Repository situation has a preview folder (e.g. /default/files/default.preview) and a live folder (e.g. /default/files/default.www) defined. Most extractors need only be defined for the preview context. Property extractors extract parts of the document into document properties, which are copied along with the document to the live repository on publication. Content extractors however extract parts of the document into a separate Lucene index. So content extractors for the live repository must always be explicitly configured to be able to use <d:contains> in DASL queries.

Common extractors

SimpleXMLExtractor

<extractor classname="org.apache.slide.extractor.SimpleXmlExtractor" uri="/files/default.www/content/bulk" contenttype="text/xml">
  <configuration>
    <instruction property="caption" namespace="http://hippo.nl/cms/1.0" xpath="/document/meta/title/text()"/>
  </configuration>
</extractor>

We recommend using nl.hippo.slide.extractor.HippoSimpleXmlExtractor (see below) instead of this extractor. The default
slide one contains bugs and has erroneous xpath behavior

If you want to use the SimpleXmlExtractor to extract the text of a comma seperated value you need to use the string() function in your xpath.
XML looking like:
<keywords keywords="key1,key2,etc"/>

Will not work with this xpath

/document/meta/keywords/@keywords

But does work with

string(/document/meta/keywords/@keywords)

Available extractors

In combination with Hippo CMS the most commonly used extractors from Slide are:

  • org.apache.slide.extractor.SimpleXmlExtractor
  • org.apache.slide.extractor.MSWordExtractor
  • org.apache.slide.extractor.MSPowerPointExtractor
  • org.apache.slide.extractor.MSExcelExtractor
  • org.apache.slide.extractor.PDFExtractor

Hippo Repository also has some extra custom extractors available that add significant value when developing websites with Hippo Repository:

  • nl.hippo.slide.extractor.XMLDatePropertyExtractor
  • nl.hippo.slide.extractor.XMLContentExtractor
  • nl.hippo.slide.extractor.ImagePropertyExtractor
  • nl.hippo.slide.extractor.OfficeExtractor
  • nl.hippo.slide.extractor.ConstantExtractor
  • nl.hippo.slide.extractor.HippoSimpleXmlExtractor
  • nl.hippo.slide.extractor.MultiValueXMLPropertyExtractor
  • nl.hippo.slide.extractor.HippoMultiValueXMLPropertyExtractor
  • nl.hippo.slide.extractor.HippoLastmodifiedExtractor
  • nl.hippo.slide.extractor.ConfigurableXMLContentExtractor

The extractors mentioned in the list above are discussed in the following paragraphs.

Default Slide extractors

org.apache.slide.extractor.SimpleXmlExtractor

The SimpleXmlExtractor can be used to extract text form an xml element or attribute. Only the first element/attr that matches is the text extracted of (See also nl.hippo.slide.extractor.MultiValueXMLPropertyExtractor for multiple elements/attributes).

Example configuration

<extractor classname="org.apache.slide.extractor.SimpleXmlExtractor" uri="/files/default.www/content" content-type="text/xml">
    <configuration>
      <instruction property="title" namespace="http://hippo.nl/cms/1.0" xpath="/document/title/text()"/>
    </configuration>
</extractor>

org.apache.slide.extractor.MSWordExtractor

This extractor enables indexing content from .doc files.

Example configuration

<extractor classname="org.apache.slide.extractor.MSWordExtractor" uri="/files/default.preview/binaries"
                 content-type="application/msword"/>

org.apache.slide.extractor.MSPowerPointExtractor

This extractor enables indexing content from .ppt files.

Example configuration

<extractor classname="org.apache.slide.extractor.MSPowerPointExtractor" uri="/files/default.preview/binaries"
                 content-type="application/vnd.ms-powerpoint | application/ms-powerpoint"/>

org.apache.slide.extractor.MSExcelExtractor

This extractor enables indexing content from .xls files.

Example configuration

<extractor classname="org.apache.slide.extractor.MSExcelExtractor" uri="/files/default.preview/binaries"
                 content-type="application/vnd.ms-excel"/>

Hippo Repository custom extractors

nl.hippo.slide.extractor.XMLDatePropertyExtractor

The XMLDatePropertyExtractor can be used to extract a date from a XML document. This extractor is also able to format dates.

Example configuration

<extractor classname="nl.hippo.slide.extractor.XMLDatePropertyExtractor" uri="/files/default.preview/content"
                 content-type="text/xml">
  <configuration>

    <instruction property="documentdate" namespace="http://hippo.nl/cms/1.0"
              xpath="concat(/document/content/eventDate/@day,'-',/document/content/eventDate/@month,'-',/document/content/eventDate/@year)"
              inputFormat="dd-MM-yyyy" outputFormat="yyyyMMdd"/>
  </configuration>
</extractor>

nl.hippo.slide.extractor.XMLContentExtractor

The XMLContentExtractor extracts only the character data from an XML stream. It is used for indexing XML content.

Example configuration

<extractor classname="nl.hippo.slide.extractor.XMLContentExtractor" uri="/files/default.www/content" content-type="text/xml"/>

nl.hippo.slide.extractor.ImagePropertyExtractor

The ImagePropertyExtractor will extract some information about an image that is located inside the assets folder of the CMS.

Properties that are extracted are:

  • width
  • height
  • bits-per-pixel
  • dpi

Example configuration

<extractor classname="nl.hippo.slide.extractor.ImagePropertyExtractor" uri="/files/default.preview/binaries"/>

nl.hippo.slide.extractor.OfficeExtractor

Apache Slide has also an org.apache.slide.extractor.OfficeExtractor. The limitation is that you cannot set the namespace for the property and it cannot transform date formats, that is why Hippo Repository has it's own Office Extractor. This extractor extracts all possible values from office documents, like author, creation-date, lastsave-date, application type, etc etc. For more information about the internals please refer to http://jakarta.apache.org/poi/hpsf/internals.html

Example configuration

<extractor classname="nl.hippo.slide.extractor.OfficeExtractor" uri="/files/default.preview/binaries" content-type="application/msword" >
  <configuration>
    <instruction property="author" namespace="http://hippo.nl/cms/1.0" summary-information="4" />
    <instruction property="application" namespace="http://hippo.nl/cms/1.0" summary-information="18"/>
    <instruction property="lastsavedate" namespace="http://hippo.nl/cms/1.0"  date-format="yyyyMMdd" summary-information="13"/>
    <instruction property="newsdate" namespace="http://hippo.nl/cms/1.0" date-format="yyyyMMdd" summary-information="12" />
    <instruction property="caption" namespace="http://hippo.nl/cms/1.0" summary-information="2" />
  </configuration>
</extractor>

In the example above the Summary-information="18" would give the application type, summary-information="13" would give the lastsavedate, etc.. For more information about the internals please refer to http://jakarta.apache.org/poi/hpsf/internals.html.

nl.hippo.slide.extractor.ConstantExtractor

The nl.hippo.slide.extractor.ConstantExtractor is not really an extractor, because it's not extracting anything. It's an extractor that can set a static property on a document. For example, when putting files from file system to Hippo Repository through webfolder (only for binaries), the files won't get a CMS type. This is solved by the nl.hippo.slide.extractor.ConstantExtractor.

Example configuration

<extractor classname="nl.hippo.slide.extractor.ConstantExtractor" uri="/files/default.preview/binaries" content-type="application/msword" >
  <configuration>
    <instruction property="type" namespace="http://hippo.nl/cms/1.0" value="asset" />
  </configuration>
</extractor>

<extractor classname="nl.hippo.slide.extractor.ConstantExtractor" uri="/files/default.preview/binaries" content-type="application/pdf" >
  <configuration>
    <instruction property="type" namespace="http://hippo.nl/cms/1.0" value="asset" />
  </configuration>
</extractor>

The above example ensures that files put into the repository at the location "/files/default.preview/binaries" will get the hippo property "type='asset'". This will be needed by the CMS.

nl.hippo.slide.extractor.MultiValueXMLPropertyExtractor

The nl.hippo.slide.extractor.MultiValueXMLPropertyExtractor is the same as the default slide org.apache.slide.extractor.SimpleXmlExtractor except that it allows you to extract multiple text values of elements/attributes. The values will be stored commaseparated. This extractor is very suited for storing dependencies between documents (like internal links)

Example configuration

<extractor classname="nl.hippo.slide.extractor.MultiValueXMLPropertyExtractor" uri="/files/default.preview/content" content-type="text/xml">
    <configuration>
      <instruction property="dependencies" namespace="http://hippo.nl/cms/1.0" xpath="//a/@href"/>
      <instruction property="keywords" namespace="http://hippo.nl/cms/1.0" xpath="//keywords/keywords/text()"/>
   </configuration>
</extractor>

nl.hippo.slide.extractor.HippoMultiValueXMLPropertyExtractor

The nl.hippo.slide.extractor.HippoMultiValueXMLPropertyExtractor is the same as nl.hippo.slide.extractor.MultiValueXMLPropertyExtractor except that does not try to validate against references to DTDs inside a document. The values will be stored commaseparated. This extractor is very suited for storing dependencies between documents (like internal links).
This extractor was introduced in version 1.2.14 of Hippo Repository.

Example configuration

<extractor classname="nl.hippo.slide.extractor.HippoMultiValueXMLPropertyExtractor" uri="/files/default.preview/content" content-type="text/xml">
    <configuration>
      <instruction property="dependencies" namespace="http://hippo.nl/cms/1.0" xpath="string(//a/@href)"/>
      <instruction property="keywords" namespace="http://hippo.nl/cms/1.0" xpath="//keywords/keywords/text()"/>
   </configuration>
</extractor>

nl.hippo.slide.extractor.HippoSimpleXmlExtractor

This extractor is available from Hippo Repository version 1.2.10 and up

This is similar to org.apache.slide.extractor.SimpleXmlExtractor but contains more powerful xpath functionality.
With nl.hippo.slide.extractor.HippoSimpleXmlExtractor the repository won't return an error if you use substring in your xpath with a length greater than the original string length. This can be useful to limit the size of the extracted value.

Example configuration

<extractor classname="nl.hippo.slide.extractor.HippoSimpleXmlExtractor" uri="/files/default.preview"
 content-type="text/xml | application/xml | text/xml; charset=UTF-8">
    <configuration>
      <instruction property="title" namespace="http://hippo.nl/cms/1.0" xpath="string(/document/meta/title)"/>
      <instruction property="lead" namespace="http://hippo.nl/cms/1.0"
                   xpath="substring(normalize-space(string(/document/content/lead/p[1])),1,250)"/>
      <instruction property="category" namespace="http://hippo.nl/cms/1.0" xpath="/document/meta/@category"/>
    </configuration>
  </extractor>

We recommend to always use this extractor instead of org.apache.slide.extractor.SimpleXmlExtractor.

nl.hippo.slide.extractor.HippoLastmodifiedExtractor

This extractor is available from Hippo Repository version 1.2.11 and up

This extractor can put the current date as a formatted property on a document. By default it will use the HTTP 1.1 spec's last modified date format "EEE, d MMM yyyy kk:mm:ss z". This can be overridden by configuration through the outputFormat attribute. Currently the locale used is Locale.US, if needed we can make this configurable as well.

Example configuration

<extractor classname="nl.hippo.slide.extractor.HippoLastmodifiedExtractor" uri="/files/default.preview"
 content-type="text/xml | application/xml | text/xml; charset=UTF-8">
    <configuration>
      <instruction property="hippoLastModified" namespace="http://hippo.nl/cms/1.0" outputFormat="yyyyMMdd"/>
    </configuration>
  </extractor>

nl.hippo.slide.extractor.UrlListXMLPropertyExtractor

This extractor can be used to extract multiple URLs as whitespace separated values into a property (URLs can contains commas, so comma separated in combination with the MultiValueXMLPropertyExtractor is not an option, but URLs cannot contain whitespaces).

Example configuration

<extractor classname="nl.hippo.slide.extractor.UrlListXMLPropertyExtractor" uri="/files/default.preview" content-type="text/xml">
  <configuration>
    <instruction property="links" namespace="http://hippo.nl/cms/1.0" xpath="//@href|//@src"/>
  </configuration>
</extractor>

In dasl-indexer.xml, configure the property as type="text" and use org.apache.lucene.analysis.WhitespaceAnalyzer.

nl.hippo.slide.extractor.HippoXmlPropertyExtractor

This extractor is available from Hippo Repository version 1.2.6 or up

This is a general purpose extractor. Most importantly it can extract complete xml structures from the documents.

<!-- this extractor is a general purpose XML property extractor! same as SimpleXmlExtractor, plus extensions -->
  <extractor classname="nl.hippo.slide.extractor.HippoXmlPropertyExtractor" uri="/files/default.preview" content-type="text/xml">
    <configuration>
      <!-- same old thing, like SimpleXmlExtractor -->
      <instruction property="title" namespace="http://hippo.nl/cms/1.0" xpath="/document/meta/title/text()"/>
      <instruction property="author" namespace="http://hippo.nl/cms/1.0" xpath="string(/document/meta/author)"/>
      <!-- new behaviour: elements will be outputted as full xml! -->
      <instruction property="test" namespace="http://hippo.nl/cms/1.0" xpath="//p[1]"/>
      <!--
        new behaviour: the "multiple" type, this will let you evaluate relative xpaths to a result of a parent xpath
        ITS IMPORTANT TO PUT THE . IN FRONT OF THE SUB-XPATHS! Otherwise, you will evaluate them from the document root!
      -->
      <instruction type="multiple" property="test2" namespace="http://hippo.nl/cms/1.0" xpath="//p">
        <sub name="bold" xpath=".//b/text()"/>
        <sub name="italic" xpath=".//i/text()"/>
      </instruction>
      <!-- new behaviour: also encompasses XmlDatePropertyExtractor! -->
      <instruction type="date" property="date" namespace="http://hippo.nl/cms/1.0"
 xpath="/document/meta/date/text()" 
                   inputFormat="yyyy-MM-dd" outputFormat="yyyymmdd"/>
    </configuration>
  </extractor>

The class name for the multi value property is nl.hippo.slide.extractor.HippoMultiValueXMLPropertyExtractor and not nl.hippo.slide.extractor.MultiValueXMLPropertyExtractor

They both exist and I've added HippoMultiValueXMLPropertyExtractor to the page. Thank you for pointing out the missing extractor.

Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.5.7 Build:#813 Aug 28, 2007) - Bug/feature request - Contact Administrators