SUNScholar/Harvesting/3.X

Back to Harvesting

DS-1902
During the upgrade from 1.8.2 to 3.2, a bug report was submitted: https://jira.duraspace.org/browse/DS-1902

The command to completely clear out the cache does not work due to the fact that our Tomcat server runs as root so that it has full access to all files in $HOME.

So I manually cleared the cache as follows: cd $HOME/var/oai/requests

sudo rm *

Then completely rebuilt the OAI SOLR DB with the import command as follows: $HOME/bin/dspace oai import -o -v

And it works.

Another solution is to disable the cache, see config setting in example below.

DS-1445
During the upgrade from 1.8.2 to 3.2, a bug report was submitted: https://jira.duraspace.org/browse/DS-1445

Email sent; Hi All

Regarding the following, is another patch required? https://jira.duraspace.org/browse/DS-1445

We use DSpace 3.2 with a SOLR DB for OAI. See: http://wiki.lib.sun.ac.za/index.php/SUNScholar/Harvesting/3.2#Example

This query seems to work. http://scholar.sun.ac.za/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:scholar.sun.ac.za:10019.1/255

This from Open Archives. http://www.openarchives.org/OAI/2.0/migration.htm#SelectiveHarvestingandDatestamps

From: http://re.cs.uct.ac.za, I get the following; Inline images 1

Error message from Open Archives;

[1] ListRecords response gave a noRecordsMatch error when it should have included at least the record with identifier oai:scholar.sun.ac.za:10019.1/255. The from and until parameters of the request were set to the datestamp of this record (2011-06-23T08:15:02Z). The from and until parameters are inclusive, see protocol spec section 2.7.1. The message included in the error response was: 'No matches for the query'

SUMMARY: Total exceptions improperly handled: 1 out of 15 Total error count: 1

Config
Edit the following file; nano $HOME/source/dspace/config/modules/oai.cfg See sample below.
 * 1) Select whether storage will be the SOLR database or the PostgreSQL database
 * 2) Define OAI URL's.
 * 3) Define the OAI folder paths.
 * 4) Define harvester settings.

Example

 * 1) XOAI CONFIGURATIONS#
 * 2) These configs are used by the XOAI                            #
 * 1) These configs are used by the XOAI                            #
 * 1) These configs are used by the XOAI                            #

storage=solr
 * 1) Storage: solr | database

solr.url=http://localhost/solr/oai identifier.prefix = scholar.sun.ac.za bitstream.baseUrl = http://scholar.sun.ac.za
 * 1) Base solr index
 * 1) OAI persistent identifier prefix.
 * 2) Format - oai:PREFIX:HANDLE
 * 1) Base url for bitstreams

config.dir = home/dspace/config/crosswalks/oai
 * 1) Base Configuration Directory

description.file = /home/dspace/config/crosswalks/oai/description.xml
 * 1) Description

cache.enabled = true
 * 1) Cache enabled?

cache.dir = /home/dspace/var/oai
 * 1) Base Cache Directory


 * 1) --OAI HARVESTING CONFIGURATIONS#
 * 2) These configs are only used by the OAI-ORE related functions  #
 * 1) These configs are only used by the OAI-ORE related functions  #
 * 1) These configs are only used by the OAI-ORE related functions  #


 * 1) Harvester settings

harvester.oai.metadataformats.dc = http://www.openarchives.org/OAI/2.0/oai_dc/, Simple Dublin Core harvester.oai.metadataformats.qdc = http://purl.org/dc/terms/, Qualified Dublin Core harvester.oai.metadataformats.dim = http://www.dspace.org/xmlns/dspace/dim, DSpace Intermediate Metadata
 * 1) Crosswalk settings; the {name} value must correspond to a declated ingestion crosswalk
 * 2) harvester.oai.metadataformats.{name} = {namespace},{optional display name}
 * 3) The display name is only used in the xmlui for the jspui there are entries in the
 * 4) Messages.properties in the form jsp.tools.edit-collection.form.label21.select.{name}


 * 1) This field works in much the same way as harvester.oai.metadataformats.PluginName
 * 2) The {name} must correspond to a declared ingestion crosswalk, while the
 * 3) {namespace} must be supported by the target OAI-PMH provider when harvesting content.
 * 4) harvester.oai.oreSerializationFormat.{name} = {namespace}

harvester.autoStart=false
 * 1) Determines whether the harvester scheduling process should be started
 * 2) automatically when the DSpace webapp is deployed.
 * 3) default: false


 * 1) Amount of time subtracted from the from argument of the PMH request to account
 * 2) for the time taken to negotiate a connection. Measured in seconds. Default value is 120.
 * 3) harvester.timePadding = 120


 * 1) How frequently the harvest scheduler checks the remote provider for updates,
 * 2) messured in minutes. The default vaule is 12 hours (or 720 minutes)
 * 3) harvester.harvestFrequency = 720


 * 1) The heartbeat is the frequency at which the harvest scheduler queries the local
 * 2) database to determine if any collections are due for a harvest cycle (based on
 * 3) the harvestFrequency) value. The scheduler is optimized to then sleep until the
 * 4) next collection is actually ready to be harvested. The minHeartbeat and
 * 5) maxHeartbeat are the lower and upper bounds on this timeframe. Measured in seconds.
 * 6) Default minHeartbeat is 30.  Default maxHeartbeat is 3600.
 * 7) harvester.minHeartbeat = 30
 * 8) harvester.maxHeartbeat = 3600


 * 1) How many harvest process threads the scheduler can spool up at once. Default value is 3.
 * 2) harvester.maxThreads = 3


 * 1) How much time passess before a harvest thread is terminated. The termination process
 * 2) waits for the current item to complete ingest and saves progress made up to that point.
 * 3) Measured in hours. Default value is 24.
 * 4) harvester.threadTimeout = 24

harvester.unknownField = add harvester.unknownSchema = fail
 * 1) When harvesting an item that contains an unknown schema or field within a schema what
 * 2) should the harvester do? Either add a new registry item for the field or schema, ignore
 * 3) the specific field or schema (importing everything else about the item), or fail with
 * 4) an error. The default value if undefined is: fail.
 * 5) Possible values: 'fail', 'add', or 'ignore'


 * 1) The webapp responsible for minting the URIs for ORE Resource Maps.
 * 2) If using oai, the dspace.oai.uri config value must be set.
 * 3) The URIs generated for ORE ReMs follow the following convention for both cases.
 * 4) format: [baseURI]/metadata/handle/[theHandle]/ore.xml
 * 5) Default value is oai
 * 6) ore.authoritative.source = oai


 * 1) A harvest process will attempt to scan the metadata of the incoming items
 * 2) (dc.identifier.uri field, to be exact) to see if it looks like a handle.
 * 3) If so, it matches the pattern against the values of this parameter.
 * 4) If there is a match the new item is assigned the handle from the metadata value
 * 5) instead of minting a new one. Default value: hdl.handle.net
 * 6) harvester.acceptedHandleServer = hdl.handle.net, handle.myu.edu


 * 1) Pattern to reject as an invalid handle prefix (known test string, for example)
 * 2) when attempting to find the handle of harvested items. If there is a match with
 * 3) this config parameter, a new handle will be minted instead. Default value: 123456789.
 * 4) harvester.rejectedHandlePrefix = 123456789, myTestHandle

Initialise OAI database
Execute one of the following tasks to update the OAI database initially and then click here to enable regular updates.

If using the SOLR DB (solr)
sudo $HOME/bin/dspace oai import -c -v See: https://wiki.duraspace.org/display/DSDOC3x/OAI+2.0+Server#OAI2.0Server-UsingSolr

If using the SQL DB (database)
sudo $HOME/bin/dspace oai compile-items See: https://wiki.duraspace.org/display/DSDOC3x/OAI+2.0+Server#OAI2.0Server-UsingDatabase