SUNScholar/Media Filters/Text Extraction

Back to Media Filters

Step 1
Check the following settings in the "dspace.cfg" file: pdffilter.skiponmemoryexception = true
 * 1) Custom settings for PDFFilter
 * 2) If true, all PDF extractions are written to temp files as they are indexed...this
 * 3) is slower, but helps ensure that PDFBox software DSpace uses doesn't eat up
 * 4) all your memory
 * 5) pdffilter.largepdfs = true
 * 6) If true, PDFs which still result in an Out of Memory error from PDFBox
 * 7) are skipped over...these problematic PDFs will never be indexed until
 * 8) memory usage can be decreased in the PDFBox software

Step 2
Enable daily media filter jobs. See link below. http://wiki.lib.sun.ac.za/index.php/SUNScholar/Daily_Admin

News

 * http://onetransistor.blogspot.co.za/2015/12/ocr-searchable-pdf-linux.html