SUNScholar/Harvesting/5.X

From Libopedia
Jump to navigation Jump to search
Back to Harvesting

Step 1 - Config

Edit the following file;

nano $HOME/source/dspace/config/modules/oai.cfg

  1. Select whether storage will be the SOLR database or the PostgreSQL database.
  2. Define OAI URL's.
  3. Define the OAI folder paths.
  4. Define harvester settings.

See sample below.

Example

#---------------------------------------------------------------#
#--------------------XOAI CONFIGURATIONS------------------------#
#---------------------------------------------------------------#
# These configs are used by the XOAI                            #
#---------------------------------------------------------------#

# Storage: solr | database 
storage=solr

# Base solr index
solr.url=http://localhost/solr/oai
# OAI persistent identifier prefix.
# Format - oai:PREFIX:HANDLE
identifier.prefix = scholar.sun.ac.za
# Base url for bitstreams
bitstream.baseUrl = http://scholar.sun.ac.za

# Base Configuration Directory
config.dir = /home/dspace/config/crosswalks/oai

# Description
description.file = /home/dspace/config/crosswalks/oai/description.xml

# Cache enabled?
cache.enabled = true

# Base Cache Directory
cache.dir = /home/dspace/var/oai

#---------------------------------------------------------------#
#--------------OAI HARVESTING CONFIGURATIONS--------------------#
#---------------------------------------------------------------#
# These configs are only used by the OAI-ORE related functions  #
#---------------------------------------------------------------#

### Harvester settings

# Crosswalk settings; the {name} value must correspond to a declated ingestion crosswalk
# harvester.oai.metadataformats.{name} = {namespace},{optional display name}
# The display name is only used in the xmlui for the jspui there are entries in the
# Messages.properties in the form jsp.tools.edit-collection.form.label21.select.{name}
harvester.oai.metadataformats.dc = http://www.openarchives.org/OAI/2.0/oai_dc/, Simple Dublin Core
harvester.oai.metadataformats.qdc = http://purl.org/dc/terms/, Qualified Dublin Core
harvester.oai.metadataformats.dim = http://www.dspace.org/xmlns/dspace/dim, DSpace Intermediate Metadata

# This field works in much the same way as harvester.oai.metadataformats.PluginName
# The {name} must correspond to a declared ingestion crosswalk, while the
# {namespace} must be supported by the target OAI-PMH provider when harvesting content.
# harvester.oai.oreSerializationFormat.{name} = {namespace}

# Determines whether the harvester scheduling process should be started
# automatically when the DSpace webapp is deployed.
# default: false
harvester.autoStart=false

# Amount of time subtracted from the from argument of the PMH request to account
# for the time taken to negotiate a connection. Measured in seconds. Default value is 120.
#harvester.timePadding = 120

# How frequently the harvest scheduler checks the remote provider for updates,
# messured in minutes. The default vaule is 12 hours (or 720 minutes)
#harvester.harvestFrequency = 720

# The heartbeat is the frequency at which the harvest scheduler queries the local
# database to determine if any collections are due for a harvest cycle (based on
# the harvestFrequency) value. The scheduler is optimized to then sleep until the
# next collection is actually ready to be harvested. The minHeartbeat and
# maxHeartbeat are the lower and upper bounds on this timeframe. Measured in seconds.
# Default minHeartbeat is 30.  Default maxHeartbeat is 3600.
#harvester.minHeartbeat = 30
#harvester.maxHeartbeat = 3600

# How many harvest process threads the scheduler can spool up at once. Default value is 3.
#harvester.maxThreads = 3

# How much time passess before a harvest thread is terminated. The termination process
# waits for the current item to complete ingest and saves progress made up to that point.
# Measured in hours. Default value is 24.
#harvester.threadTimeout = 24

# When harvesting an item that contains an unknown schema or field within a schema what
# should the harvester do? Either add a new registry item for the field or schema, ignore
# the specific field or schema (importing everything else about the item), or fail with
# an error. The default value if undefined is: fail.
# Possible values: 'fail', 'add', or 'ignore'
harvester.unknownField  = add
harvester.unknownSchema = fail

# The webapp responsible for minting the URIs for ORE Resource Maps.
# If using oai, the dspace.oai.uri config value must be set.
# The URIs generated for ORE ReMs follow the following convention for both cases.
# format: [baseURI]/metadata/handle/[theHandle]/ore.xml
# Default value is oai
#ore.authoritative.source = oai

# A harvest process will attempt to scan the metadata of the incoming items
# (dc.identifier.uri field, to be exact) to see if it looks like a handle.
# If so, it matches the pattern against the values of this parameter.
# If there is a match the new item is assigned the handle from the metadata value
# instead of minting a new one. Default value: hdl.handle.net
#harvester.acceptedHandleServer = hdl.handle.net, handle.myu.edu

# Pattern to reject as an invalid handle prefix (known test string, for example)
# when attempting to find the handle of harvested items. If there is a match with
# this config parameter, a new handle will be minted instead. Default value: 123456789.
#harvester.rejectedHandlePrefix = 123456789, myTestHandle

Step 2 - Initialise OAI database

Execute one of the following tasks to update the OAI database initially and then click here to enable regular updates.

If using the SOLR DB (solr)

sudo $HOME/bin/dspace oai import -c -v

Please note: This may take a while on a system with many thousands of items!

See: https://wiki.duraspace.org/display/DSDOC5x/OAI+2.0+Server#OAI2.0Server-UsingSolr

If using the SQL DB (database)

sudo $HOME/bin/dspace oai compile-items

See: https://wiki.duraspace.org/display/DSDOC5x/OAI+2.0+Server#OAI2.0Server-UsingDatabase