SUNScholar/Harvesting/5.X

Back to Harvesting

Step 1 - Config
Edit the following file; nano $HOME/source/dspace/config/modules/oai.cfg   See sample below.
 * 1) Select whether storage will be the SOLR database or the PostgreSQL database.
 * 2) Define OAI URL's.
 * 3) Define the OAI folder paths.
 * 4) Define harvester settings.

Example

 * 1) XOAI CONFIGURATIONS#
 * 2) These configs are used by the XOAI                            #
 * 1) These configs are used by the XOAI                            #
 * 1) These configs are used by the XOAI                            #

storage=solr
 * 1) Storage: solr | database

solr.url=http://localhost/solr/oai identifier.prefix = scholar.sun.ac.za bitstream.baseUrl = http://scholar.sun.ac.za
 * 1) Base solr index
 * 1) OAI persistent identifier prefix.
 * 2) Format - oai:PREFIX:HANDLE
 * 1) Base url for bitstreams

config.dir = /home/dspace/config/crosswalks/oai
 * 1) Base Configuration Directory

description.file = /home/dspace/config/crosswalks/oai/description.xml
 * 1) Description

cache.enabled = true
 * 1) Cache enabled?

cache.dir = /home/dspace/var/oai
 * 1) Base Cache Directory


 * 1) --OAI HARVESTING CONFIGURATIONS#
 * 2) These configs are only used by the OAI-ORE related functions  #
 * 1) These configs are only used by the OAI-ORE related functions  #
 * 1) These configs are only used by the OAI-ORE related functions  #


 * 1) Harvester settings

harvester.oai.metadataformats.dc = http://www.openarchives.org/OAI/2.0/oai_dc/, Simple Dublin Core harvester.oai.metadataformats.qdc = http://purl.org/dc/terms/, Qualified Dublin Core harvester.oai.metadataformats.dim = http://www.dspace.org/xmlns/dspace/dim, DSpace Intermediate Metadata
 * 1) Crosswalk settings; the {name} value must correspond to a declated ingestion crosswalk
 * 2) harvester.oai.metadataformats.{name} = {namespace},{optional display name}
 * 3) The display name is only used in the xmlui for the jspui there are entries in the
 * 4) Messages.properties in the form jsp.tools.edit-collection.form.label21.select.{name}


 * 1) This field works in much the same way as harvester.oai.metadataformats.PluginName
 * 2) The {name} must correspond to a declared ingestion crosswalk, while the
 * 3) {namespace} must be supported by the target OAI-PMH provider when harvesting content.
 * 4) harvester.oai.oreSerializationFormat.{name} = {namespace}

harvester.autoStart=false
 * 1) Determines whether the harvester scheduling process should be started
 * 2) automatically when the DSpace webapp is deployed.
 * 3) default: false


 * 1) Amount of time subtracted from the from argument of the PMH request to account
 * 2) for the time taken to negotiate a connection. Measured in seconds. Default value is 120.
 * 3) harvester.timePadding = 120


 * 1) How frequently the harvest scheduler checks the remote provider for updates,
 * 2) messured in minutes. The default vaule is 12 hours (or 720 minutes)
 * 3) harvester.harvestFrequency = 720


 * 1) The heartbeat is the frequency at which the harvest scheduler queries the local
 * 2) database to determine if any collections are due for a harvest cycle (based on
 * 3) the harvestFrequency) value. The scheduler is optimized to then sleep until the
 * 4) next collection is actually ready to be harvested. The minHeartbeat and
 * 5) maxHeartbeat are the lower and upper bounds on this timeframe. Measured in seconds.
 * 6) Default minHeartbeat is 30.  Default maxHeartbeat is 3600.
 * 7) harvester.minHeartbeat = 30
 * 8) harvester.maxHeartbeat = 3600


 * 1) How many harvest process threads the scheduler can spool up at once. Default value is 3.
 * 2) harvester.maxThreads = 3


 * 1) How much time passess before a harvest thread is terminated. The termination process
 * 2) waits for the current item to complete ingest and saves progress made up to that point.
 * 3) Measured in hours. Default value is 24.
 * 4) harvester.threadTimeout = 24

harvester.unknownField = add harvester.unknownSchema = fail
 * 1) When harvesting an item that contains an unknown schema or field within a schema what
 * 2) should the harvester do? Either add a new registry item for the field or schema, ignore
 * 3) the specific field or schema (importing everything else about the item), or fail with
 * 4) an error. The default value if undefined is: fail.
 * 5) Possible values: 'fail', 'add', or 'ignore'


 * 1) The webapp responsible for minting the URIs for ORE Resource Maps.
 * 2) If using oai, the dspace.oai.uri config value must be set.
 * 3) The URIs generated for ORE ReMs follow the following convention for both cases.
 * 4) format: [baseURI]/metadata/handle/[theHandle]/ore.xml
 * 5) Default value is oai
 * 6) ore.authoritative.source = oai


 * 1) A harvest process will attempt to scan the metadata of the incoming items
 * 2) (dc.identifier.uri field, to be exact) to see if it looks like a handle.
 * 3) If so, it matches the pattern against the values of this parameter.
 * 4) If there is a match the new item is assigned the handle from the metadata value
 * 5) instead of minting a new one. Default value: hdl.handle.net
 * 6) harvester.acceptedHandleServer = hdl.handle.net, handle.myu.edu


 * 1) Pattern to reject as an invalid handle prefix (known test string, for example)
 * 2) when attempting to find the handle of harvested items. If there is a match with
 * 3) this config parameter, a new handle will be minted instead. Default value: 123456789.
 * 4) harvester.rejectedHandlePrefix = 123456789, myTestHandle

Step 2 - Initialise OAI database
Execute one of the following tasks to update the OAI database initially and then click here to enable regular updates.

If using the SOLR DB (solr)
sudo $HOME/bin/dspace oai import -c -v

Please note: This may take a while on a system with many thousands of items!

See: https://wiki.duraspace.org/display/DSDOC5x/OAI+2.0+Server#OAI2.0Server-UsingSolr

If using the SQL DB (database)
sudo $HOME/bin/dspace oai compile-items See: https://wiki.duraspace.org/display/DSDOC5x/OAI+2.0+Server#OAI2.0Server-UsingDatabase