SUNScholar/Harvesting/3.X
Back to Harvesting
Contents
Urgent Notice
Since the upgrade from DSpace 1.8.2 to DSpace 3.2, the following cannot be harvested:
http://scholar.sun.ac.za/oai/request?verb=ListRecords&metadataPrefix=oai_dc
Bug report submitted: https://jira.duraspace.org/browse/DS-1902
Please Note:
The command to completely clear out the cache does not work. So I manually cleared the cache as follows:
cd /home/dspace/var/oai/requests
sudo rm *
Then rebuild the index with the import command. (/home/dspace/bin/dspace import -o -v -c)
Config
Edit the following file;
nano /home/dspace/source/dspace/config/modules/oai.cfg
- Select whether storage will be the SOLR database or the PostgreSQL database
- Define OAI URL's.
- Define the OAI folder paths.
- Define harvester settings.
See sample below.
Sample
#---------------------------------------------------------------#
#--------------------XOAI CONFIGURATIONS------------------------#
#---------------------------------------------------------------#
# These configs are used by the XOAI #
#---------------------------------------------------------------#
# Storage: solr | database
storage=database
# Base solr index
solr.url=http://localhost/solr/oai
# OAI persistent identifier prefix.
# Format - oai:PREFIX:HANDLE
identifier.prefix = scholar.sun.ac.za
# Base url for bitstreams
bitstream.baseUrl = http://scholar.sun.ac.za
# Base Configuration Directory
config.dir = /home/dspace/config/crosswalks/oai
# Description
description.file = /home/dspace/config/crosswalks/oai/description.xml
# Cache enabled?
cache.enabled = true
# Base Cache Directory
cache.dir = /home/dspace/var/oai
#---------------------------------------------------------------#
#--------------OAI HARVESTING CONFIGURATIONS--------------------#
#---------------------------------------------------------------#
# These configs are only used by the OAI-ORE related functions #
#---------------------------------------------------------------#
### Harvester settings
# Crosswalk settings; the {name} value must correspond to a declated ingestion crosswalk
# harvester.oai.metadataformats.{name} = {namespace},{optional display name}
# The display name is only used in the xmlui for the jspui there are entries in the
# Messages.properties in the form jsp.tools.edit-collection.form.label21.select.{name}
harvester.oai.metadataformats.dc = http://www.openarchives.org/OAI/2.0/oai_dc/, Simple Dublin Core
harvester.oai.metadataformats.qdc = http://purl.org/dc/terms/, Qualified Dublin Core
harvester.oai.metadataformats.dim = http://www.dspace.org/xmlns/dspace/dim, DSpace Intermediate Metadata
# This field works in much the same way as harvester.oai.metadataformats.PluginName
# The {name} must correspond to a declared ingestion crosswalk, while the
# {namespace} must be supported by the target OAI-PMH provider when harvesting content.
# harvester.oai.oreSerializationFormat.{name} = {namespace}
# Determines whether the harvester scheduling process should be started
# automatically when the DSpace webapp is deployed.
# default: false
harvester.autoStart=false
# Amount of time subtracted from the from argument of the PMH request to account
# for the time taken to negotiate a connection. Measured in seconds. Default value is 120.
#harvester.timePadding = 120
# How frequently the harvest scheduler checks the remote provider for updates,
# messured in minutes. The default vaule is 12 hours (or 720 minutes)
#harvester.harvestFrequency = 720
# The heartbeat is the frequency at which the harvest scheduler queries the local
# database to determine if any collections are due for a harvest cycle (based on
# the harvestFrequency) value. The scheduler is optimized to then sleep until the
# next collection is actually ready to be harvested. The minHeartbeat and
# maxHeartbeat are the lower and upper bounds on this timeframe. Measured in seconds.
# Default minHeartbeat is 30. Default maxHeartbeat is 3600.
#harvester.minHeartbeat = 30
#harvester.maxHeartbeat = 3600
# How many harvest process threads the scheduler can spool up at once. Default value is 3.
#harvester.maxThreads = 3
# How much time passess before a harvest thread is terminated. The termination process
# waits for the current item to complete ingest and saves progress made up to that point.
# Measured in hours. Default value is 24.
#harvester.threadTimeout = 24
# When harvesting an item that contains an unknown schema or field within a schema what
# should the harvester do? Either add a new registry item for the field or schema, ignore
# the specific field or schema (importing everything else about the item), or fail with
# an error. The default value if undefined is: fail.
# Possible values: 'fail', 'add', or 'ignore'
harvester.unknownField = add
harvester.unknownSchema = fail
# The webapp responsible for minting the URIs for ORE Resource Maps.
# If using oai, the dspace.oai.uri config value must be set.
# The URIs generated for ORE ReMs follow the following convention for both cases.
# format: [baseURI]/metadata/handle/[theHandle]/ore.xml
# Default value is oai
#ore.authoritative.source = oai
# A harvest process will attempt to scan the metadata of the incoming items
# (dc.identifier.uri field, to be exact) to see if it looks like a handle.
# If so, it matches the pattern against the values of this parameter.
# If there is a match the new item is assigned the handle from the metadata value
# instead of minting a new one. Default value: hdl.handle.net
#harvester.acceptedHandleServer = hdl.handle.net, handle.myu.edu
# Pattern to reject as an invalid handle prefix (known test string, for example)
# when attempting to find the handle of harvested items. If there is a match with
# this config parameter, a new handle will be minted instead. Default value: 123456789.
#harvester.rejectedHandlePrefix = 123456789, myTestHandle
Daily Task
Click here to define the following tasks to update the OAI database initially and then daily.
If using the SQL DB (database)
/home/dspace/bin/dspace oai compile-items
If using the SOLR DB (solr)
/home/dspace/bin/dspace oai import -o
References
- https://wiki.duraspace.org/display/DSDOC3x/OAI
- https://wiki.duraspace.org/display/DSDOC3x/OAI+2.0+Server
- https://wiki.duraspace.org/pages/viewpage.action?pageId=32478690
- https://wiki.duraspace.org/display/DSDOC3x/OAI+2.0+Server#OAI2.0Server-BasicConfiguration
- https://wiki.duraspace.org/display/DSDOC3x/OAI+2.0+Server#OAI2.0Server-UsingDatabase
- https://wiki.duraspace.org/display/DSDOC3x/OAI+2.0+Server#OAI2.0Server-OAIManager(SolrDataSource)
- https://wiki.duraspace.org/display/DSDOC3x/OAI#OAI-OAI-PMH/OAI-OREHarvesterConfiguration