SUNScholar/Harvesting/4.X

Back to Harvesting

Config
Edit the following file; nano $HOME/source/dspace/config/modules/oai.cfg  See sample below. 
 * 1) Select whether storage will be the SOLR database or the PostgreSQL database.
 * 2) Define OAI URL's.
 * 3) Define the OAI folder paths.
 * 4) Define harvester settings.

Example

 * 1) XOAI CONFIGURATIONS#
 * 2) These configs are used by the XOAI                            #
 * 1) These configs are used by the XOAI                            #
 * 1) These configs are used by the XOAI                            #

storage=solr
 * 1) Storage: solr | database

solr.url=http://localhost/solr/oai identifier.prefix = scholar.sun.ac.za bitstream.baseUrl = http://scholar.sun.ac.za
 * 1) Base solr index
 * 1) OAI persistent identifier prefix.
 * 2) Format - oai:PREFIX:HANDLE
 * 1) Base url for bitstreams

config.dir = /home/dspace/config/crosswalks/oai
 * 1) Base Configuration Directory

description.file = /home/dspace/config/crosswalks/oai/description.xml
 * 1) Description

cache.enabled = true
 * 1) Cache enabled?

cache.dir = /home/dspace/var/oai
 * 1) Base Cache Directory


 * 1) --OAI HARVESTING CONFIGURATIONS#
 * 2) These configs are only used by the OAI-ORE related functions  #
 * 1) These configs are only used by the OAI-ORE related functions  #
 * 1) These configs are only used by the OAI-ORE related functions  #


 * 1) Harvester settings

harvester.oai.metadataformats.dc = http://www.openarchives.org/OAI/2.0/oai_dc/, Simple Dublin Core harvester.oai.metadataformats.qdc = http://purl.org/dc/terms/, Qualified Dublin Core harvester.oai.metadataformats.dim = http://www.dspace.org/xmlns/dspace/dim, DSpace Intermediate Metadata
 * 1) Crosswalk settings; the {name} value must correspond to a declated ingestion crosswalk
 * 2) harvester.oai.metadataformats.{name} = {namespace},{optional display name}
 * 3) The display name is only used in the xmlui for the jspui there are entries in the
 * 4) Messages.properties in the form jsp.tools.edit-collection.form.label21.select.{name}


 * 1) This field works in much the same way as harvester.oai.metadataformats.PluginName
 * 2) The {name} must correspond to a declared ingestion crosswalk, while the
 * 3) {namespace} must be supported by the target OAI-PMH provider when harvesting content.
 * 4) harvester.oai.oreSerializationFormat.{name} = {namespace}

harvester.autoStart=false
 * 1) Determines whether the harvester scheduling process should be started
 * 2) automatically when the DSpace webapp is deployed.
 * 3) default: false


 * 1) Amount of time subtracted from the from argument of the PMH request to account
 * 2) for the time taken to negotiate a connection. Measured in seconds. Default value is 120.
 * 3) harvester.timePadding = 120


 * 1) How frequently the harvest scheduler checks the remote provider for updates,
 * 2) messured in minutes. The default vaule is 12 hours (or 720 minutes)
 * 3) harvester.harvestFrequency = 720


 * 1) The heartbeat is the frequency at which the harvest scheduler queries the local
 * 2) database to determine if any collections are due for a harvest cycle (based on
 * 3) the harvestFrequency) value. The scheduler is optimized to then sleep until the
 * 4) next collection is actually ready to be harvested. The minHeartbeat and
 * 5) maxHeartbeat are the lower and upper bounds on this timeframe. Measured in seconds.
 * 6) Default minHeartbeat is 30.  Default maxHeartbeat is 3600.
 * 7) harvester.minHeartbeat = 30
 * 8) harvester.maxHeartbeat = 3600


 * 1) How many harvest process threads the scheduler can spool up at once. Default value is 3.
 * 2) harvester.maxThreads = 3


 * 1) How much time passess before a harvest thread is terminated. The termination process
 * 2) waits for the current item to complete ingest and saves progress made up to that point.
 * 3) Measured in hours. Default value is 24.
 * 4) harvester.threadTimeout = 24

harvester.unknownField = add harvester.unknownSchema = fail
 * 1) When harvesting an item that contains an unknown schema or field within a schema what
 * 2) should the harvester do? Either add a new registry item for the field or schema, ignore
 * 3) the specific field or schema (importing everything else about the item), or fail with
 * 4) an error. The default value if undefined is: fail.
 * 5) Possible values: 'fail', 'add', or 'ignore'


 * 1) The webapp responsible for minting the URIs for ORE Resource Maps.
 * 2) If using oai, the dspace.oai.uri config value must be set.
 * 3) The URIs generated for ORE ReMs follow the following convention for both cases.
 * 4) format: [baseURI]/metadata/handle/[theHandle]/ore.xml
 * 5) Default value is oai
 * 6) ore.authoritative.source = oai


 * 1) A harvest process will attempt to scan the metadata of the incoming items
 * 2) (dc.identifier.uri field, to be exact) to see if it looks like a handle.
 * 3) If so, it matches the pattern against the values of this parameter.
 * 4) If there is a match the new item is assigned the handle from the metadata value
 * 5) instead of minting a new one. Default value: hdl.handle.net
 * 6) harvester.acceptedHandleServer = hdl.handle.net, handle.myu.edu


 * 1) Pattern to reject as an invalid handle prefix (known test string, for example)
 * 2) when attempting to find the handle of harvested items. If there is a match with
 * 3) this config parameter, a new handle will be minted instead. Default value: 123456789.
 * 4) harvester.rejectedHandlePrefix = 123456789, myTestHandle

Initialise OAI database
Execute one of the following tasks to update the OAI database initially and then click here to enable regular updates.

If using the SOLR DB (solr)
sudo $HOME/bin/dspace oai import -c -v

Please note: This may take a while on a system with many thousands of items!

See: https://wiki.duraspace.org/display/DSDOC4x/OAI+2.0+Server#OAI2.0Server-UsingSolr

If using the SQL DB (database)
sudo $HOME/bin/dspace oai compile-items See: https://wiki.duraspace.org/display/DSDOC4x/OAI+2.0+Server#OAI2.0Server-UsingDatabase