Difference between revisions of "SUNScholar/Digitisation"

From Libopedia
Jump to navigation Jump to search
 
(164 intermediate revisions by 9 users not shown)
Line 1: Line 1:
=Objectives=
+
<center>
The objective is to convert to digital format any material using [[SUNScholar/DigitisationEquipment|digital equipment]].
+
'''[[SUNScholar/Practical guidelines for starting an institutional repository (IR)|Back to Guidelines]]'''
 +
'''[[SUNScholar/Repository_Preservation|Back to Repository Preservation]]'''
 +
</center>
  
The resultant digital object must adhere to the following:
+
==Introduction==
# '''<font color="red">Use an uncompressed bitstream for storage.</font>'''
+
In order to populate the digital research repository with print research material one has to digitise the print material first.
# '''<font color="red">Use open digital formats with no patent liability and which have open published standards.</font>'''
 
  
For more information, see: http://en.wikipedia.org/wiki/Digitizing
+
The question, therefore is... how to proceed with the digitisation process in an orderly and managed manner?
  
=Digital Format Registry=
+
<font color="red">'''Watch the short video below'''</font>
* http://www.udfr.org
 
* http://www.gdfr.info
 
  
=Common Closed Digital Formats=
+
<html5media width="560" height="315">File:Brewster_kahle-_digitize-everything.mp4</html5media>
See: http://patentabsurdity.com and http://en.swpat.org
 
==Documents==
 
All the Microsoft document formats are closed.
 
  
This is a huge problem for [[SUNScholar/Digital_Preservation|digital preservation]].
+
==Archival Digital Objects==
* http://www.digitalpreservation.gov/formats/intro/intro.shtml
+
In order to provide a high quality service for current and future users it is recommended that the process of digitisation produces the best possible digital copy of the original print item. However it may not be practical to deliver these high quality digital objects publicly since they may be very large in size, therefore you may need to store these high quality digital objects on an internal platform and only make digitally compressed versions available publicly. For the purposes of this wiki help page, the high quality digitised object is referred to as the "archival digital object".
* http://en.wikipedia.org/wiki/Comparison_of_Office_Open_XML_and_OpenDocument
 
  
==Multimedia==
+
==Digitisation process==
All the Microsoft media formats are closed.
+
===Step 1===
 +
''Determine the scope of the project by asking the following questions:''
 +
#What will be digitised?
 +
#[[SUNScholar/DigitisationEquipment|Who will perform the actual digitisation?]]
 +
#How long will the digitisation take?
 +
#[[SUNScholar/Digitisation/Digital_Formats|What standards will be applied to the resultant archival digital objects?]]
 +
#And finally, what will it cost to digitise the identified items, in the appropriate time with the appropriate digital object standards?
  
This is a huge problem for [[SUNScholar/Digital_Preservation|digital preservation]].
+
===Step 2===
* http://en.wikipedia.org/wiki/Windows_Media_Audio
+
''From the scope defined above, determine if there is capacity to store and manage the [[SUNScholar/Preservable_Digital_Objects|resultant archival digital objects]] in the long term by asking the following questions:''
* http://en.wikipedia.org/wiki/Windows_Media_Video
+
#How and where will the resultant archival digital objects be stored?
;Other closed media formats
+
#Is there enough archival storage capacity for the digital objects?
* http://en.wikipedia.org/wiki/Category:Open_formats_closed_by_software_patents
+
#Who will curate the provenance of the resultant archival digital objects, in the short and long term?
* http://en.wikipedia.org/wiki/Mp3 (Lossy audio codec, many patent trolls)
+
#[[SUNScholar/Disaster_Recovery|How will we implement a disaster recovery system for the archival digital objects in storage?]]
* http://en.wikipedia.org/wiki/Advanced_Audio_Coding (Lossy audio codec, many patent trolls)
 
* http://en.wikipedia.org/wiki/Mpeg4 (Lossy video codec, many patent trolls)
 
* http://en.wikipedia.org/wiki/Jpeg (Lossy image codec, many patent trolls)
 
* http://en.wikipedia.org/wiki/Tagged_Image_File_Format (Lossless image codec, many patent trolls)
 
* http://en.wikipedia.org/wiki/Flash_Video (Closed format multimedia container, many patent trolls)
 
  
=Multimedia=
+
===Step 3===
==Converter software==
+
''Now that the items identified have been digitised and stored, the next step is to determine how the digitised objects are made available to users of the library by asking the following questions:''
* http://www.longtailvideo.com/support/blog/12633/an-overview-of-audio-and-video-transcoding
+
#[[SUNScholar/Copyright|Are there any intellectual property concerns regarding the digital objects?]]
* http://www.nchsoftware.com/index.html
+
#[[SUNScholar/Digitisation/Digital_Formats|If the digital object can be made public, what digital format is appropriate for public consumption?]]
* http://www.gnomefiles.org/app.php/OggConvert
+
#[[List_of_Repository_Software|What platform will be used for public dissemination of the digitised items?]]
* http://www.linuxrising.org/transmageddon
+
#[[SUNScholar/Capacity_Building/Digital_Repository_Content_Management|Who will submit the digital objects to the public platform?]]
 +
#[[SUNScholar/Metadata|What metadata standards will be applied to the digital objects stored on the public platform?]]
 +
#[[SUNScholar/Capacity_Building/Digital_Repository_Systems_Management|Does the public platform have enough storage capacity for the digital objects?]]
 +
#[[SUNScholar/Capacity_Building/Digital_Repository_Systems_Management|Does the public platform have enough computing capacity to deal with the number of anticipated users who will visit and download or view the digital objects?]]
 +
#[[SUNScholar/Disaster_Recovery|Does the public platform implement a disaster recovery system?]]
  
==List of HTML5 compatible video formats==
+
===Step 4===
See: http://wiki.whatwg.org/wiki/Main_Page
+
<font color="red">'''To ensure these digital objects are preserved for future users of the library, the library top management should develop a [[SUNScholar/Repository_Preservation|digital preservation policy and action plans]].'''</font>
===Open Codecs===
 
;Dirac video and Vorbis audio in Matroska container
 
:<source src='video.mkv' type='video/x-matroska; codecs="dirac, vorbis"'>
 
;Theora video and Vorbis audio in Matroska container
 
:<source src='video.mkv' type='video/x-matroska; codecs="theora, vorbis"'>
 
;Dirac video and Vorbis audio in Ogg container
 
:<source src='video.ogv' type='video/ogg; codecs="dirac, vorbis"'>
 
:;http://diracvideo.org/wiki/index.php/Ffmpeg2dirac
 
;Theora video and Vorbis audio in Ogg container
 
:<source src='video.ogv' type='video/ogg; codecs="theora, vorbis"'>
 
:;http://v2v.cc/~j/ffmpeg2theora
 
;Theora video and Speex audio in Ogg container
 
:<source src='video.ogv' type='video/ogg; codecs="theora, speex"'>
 
;Vorbis audio alone in Ogg container
 
:<source src='audio.ogg' type='audio/ogg; codecs=vorbis'>
 
;Speex audio alone in Ogg container
 
:<source src='audio.spx' type='audio/ogg; codecs=speex'>
 
;FLAC audio alone in Ogg container
 
:<source src='audio.oga' type='audio/ogg; codecs=flac'>
 
  
===Closed Codecs===
+
==Service Providers==
====H.264====
+
*http://heritage.blogs.africamediaonline.com/training/the-digital-campus
;H.264 Simple baseline profile video (main and extended video compatible) level 3 and Low-Complexity AAC audio in MP4 container
+
*https://en.wikipedia.org/wiki/Africa_Media_Online
:<source src='video.mp4' type='video/mp4; codecs="avc1.42E01E, mp4a.40.2"'>
+
==Training==
;H.264 Extended profile video (baseline-compatible) level 3 and Low-Complexity AAC audio in MP4 container
+
*http://dp.la/info/2015/10/07/new-self-guided-curriculum-for-digitization
:<source src='video.mp4' type='video/mp4; codecs="avc1.58A01E, mp4a.40.2"'>
 
;H.264 Main profile video level 3 and Low-Complexity AAC audio in MP4 container
 
:<source src='video.mp4' type='video/mp4; codecs="avc1.4D401E, mp4a.40.2"'>
 
;H.264 'High' profile video (incompatible with main, baseline, or extended profiles) level 3 and Low-Complexity AAC audio in MP4 container
 
:<source src='video.mp4' type='video/mp4; codecs="avc1.64001E, mp4a.40.2"'>
 
====MPEG-4====
 
;MPEG-4 Visual Simple Profile Level 0 video and Low-Complexity AAC audio in MP4 container
 
:<source src='video.mp4' type='video/mp4; codecs="mp4v.20.8, mp4a.40.2"'>
 
;MPEG-4 Advanced Simple Profile Level 0 video and Low-Complexity AAC audio in MP4 container
 
:<source src='video.mp4' type='video/mp4; codecs="mp4v.20.240, mp4a.40.2"'>
 
;MPEG-4 Visual Simple Profile Level 0 video and AMR audio in 3GPP container
 
:<source src='video.3gp' type='video/3gpp; codecs="mp4v.20.8, samr"'>
 
  
==Audio==
+
==[[SUNScholar/References|References]]==
* http://en.wikipedia.org/wiki/Comparison_of_audio_codecs
+
*https://wiki.diglib.org/Digitizing_Special_Formats
;Codecs
+
===[[SUNScholar/Digitisation/Digital Formats|Digital Formats]]===
* http://en.wikipedia.org/wiki/Vorbis (Lossless codec) See: http://xiph.org/vorbis/doc/Vorbis_I_spec.html
+
===[[SUNScholar/Preservable_Digital_Objects|Preservable Digital Objects]]===
* http://en.wikipedia.org/wiki/FLAC (Lossless codec) See: http://flac.sourceforge.net
+
===[[SUNScholar/DigitisationEquipment|Digitisation Equipment and Services]]===
* http://en.wikipedia.org/wiki/Speex (Lossy codec) See: http://speex.org/docs/manual/speex-manual
+
===[[SUNScholar/Digitisation/Guidelines for scanners|Guidelines For Scanners]]===
==Video==
 
* http://en.wikipedia.org/wiki/Comparison_of_video_codecs
 
;Codecs
 
* http://en.wikipedia.org/wiki/Dirac_(codec) (Lossy and Lossless codec). See: http://diracvideo.org
 
* http://en.wikipedia.org/wiki/Theora (Lossy codec). See: http://theora.org
 
:;MPEG-4
 
* http://en.wikipedia.org/wiki/MPEG-4 (Lossy codec) See: http://www.mpegla.com
 
* http://en.wikipedia.org/wiki/Xvid (GPL codec for patented MPEG-4 video). See: http://www.xvid.org
 
:;H.264/MPEG-4 AVC
 
* http://en.wikipedia.org/wiki/H.264/MPEG-4_AVC (Lossy codec) See: http://www.mpegla.com
 
* http://en.wikipedia.org/wiki/X264 (GPL codec for patented H.264/MPEG-4 video). See: http://www.videolan.org/developers/x264.html
 
 
 
==Images==
 
* http://en.wikipedia.org/wiki/Comparison_of_graphics_file_formats
 
* http://en.wikipedia.org/wiki/Portable_Network_Graphics (Lossless codec, patent free)
 
* http://en.wikipedia.org/wiki/Scalable_Vector_Graphics (Lossless codec, patent free)
 
* http://en.wikipedia.org/wiki/Graphics_Interchange_Format (Variable codec, patent expired)
 
 
 
==Container Formats==
 
* http://en.wikipedia.org/wiki/Comparison_of_container_formats
 
* http://en.wikipedia.org/wiki/Matroska (Patent free, open standard container) See: http://www.matroska.org
 
* http://en.wikipedia.org/wiki/Ogg (Patent free, open standard container)
 
 
 
=Documents=
 
* http://documentfreedom.org
 
* http://www.pdfa.org
 
* http://en.wikipedia.org/wiki/PDF/A
 
* http://www.iso.org/iso/catalogue_detail?csnumber=38920
 
* http://en.wikipedia.org/wiki/OpenDocument
 
* http://en.wikipedia.org/wiki/Office_Open_XML
 
* http://en.wikipedia.org/wiki/HTML
 
* http://en.wikipedia.org/wiki/Plain_text
 
* http://en.wikipedia.org/wiki/LaTeX
 
;Comments
 
<pre>
 
Dear Hilton,
 
 
 
I would advise that you adopt open (i.e. non-propriety) standards, as these have the best chance of remaining readable in the long-term future.
 
Propriety formats are dependent on the continuing existence of the firm who markets them, as well as the continued support by this firm, even if they continue to exist.
 
This is in my opinion very risky.
 
 
 
For documents I am aware of an ISO standard that is targeted at archival, known as PDF/A (see www.pdfa.org).
 
 
 
For audio and video the situation is less developed, and there are as far as I know no standards specifically for archival.
 
In both cases I would recommend that data be saved without lossy compression, and again that open standards be sought.
 
Hence mp3 and WMV should be avoided, both because they are based on lossy  compression and are are propriety.
 
The audio format FLAC on the other hand is open and does not employ lossy compression.
 
 
 
I hope this is of help,
 
Best regards,
 
Thomas Niesler.
 
 
 
------------------------------------------------
 
Prof. Thomas Niesler
 
Digital Signal Processing Group
 
Department of Electronic Engineering
 
University of Stellenbosch
 
Private Bag X1, Stellenbosch 7602, South Africa
 
Phone: +27 21 8084118
 
Fax:  +27 21 8084981
 
Email: trn@dsp.sun.ac.za
 
</pre>
 
=Microfiche=
 
* http://www.fctec.co.za
 
=Software=
 
* http://www.fsf.org and http://www.opensource.org
 
 
 
=Data Sets=
 
* http://en.wikipedia.org/wiki/Sql
 
* http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
 
* http://en.wikipedia.org/wiki/Comparison_of_database_tools
 
 
 
=Engineering drawings=
 
See: http://www.opendesign.com
 
 
 
=[[SUNScholar/Metadata|Metadata]]=
 
Click on the heading above.
 
 
 
=Language=
 
* http://en.wikipedia.org/wiki/UTF-8
 
* http://en.wikipedia.org/wiki/Langauge_codes
 
 
 
=Digitisation Guidelines=
 
{| class="wikitable" border="1" style="text-align:center;1px"
 
! Media type !! Resolution !! Bit depth !! Enhancements Allowed
 
|-
 
| Printed text || 300 dpi || Bitonal || Sharpening, descreening, cropping, deskewing, despeckling
 
|-
 
| Rare/ damaged printed text || 400 dpi || 8-gray or 24 colour || Contrast stretching; Minimal adjustments for tone and colour
 
|-
 
| Book illustrations  || 400 - 600 dpi with enhancement  || 8-gray or 24 colour; Bitonal || Contrast stretching; Minimal adjustments for tone and colour; Descreen/ rescreen, sharpen
 
|-
 
| Manuscripts || 300 - 500 dpi with enhancement || 8-gray or 24 colour || Contrast stretching; Minimal adjustments for tone and colour
 
|-
 
| Maps and other oversized items || 300 - 400 dpi  || 8-gray or 24 colour  || Contrast stretching; Minimal adjustments for tone and colour
 
|-
 
| Graphic Art || 400 - 600 dpi  || 8-bit/ channel internal reduction  || Contrast stretching; Minimal adjustments for tone and colour
 
|-
 
|}
 
 
 
;Please note:
 
* All archival material to be digitised in tiff format
 
* Tiff copy together with derivated png or any additional copies to be submitted to SUNScholar
 
* Document provenance metadata:
 
** dc.description.provenance e.g. Original scanned in at 600 dpi, 100% DigiBook 10000 RGB colour, downsized to 840 pixels in width, resolution 250. Web version done automatically by PhotoShop 7 software. Downloading time approx. 26 seconds. Date done March - April 2007.
 

Latest revision as of 12:28, 1 September 2016

Back to Guidelines
Back to Repository Preservation

Introduction

In order to populate the digital research repository with print research material one has to digitise the print material first.

The question, therefore is... how to proceed with the digitisation process in an orderly and managed manner?

Watch the short video below

Archival Digital Objects

In order to provide a high quality service for current and future users it is recommended that the process of digitisation produces the best possible digital copy of the original print item. However it may not be practical to deliver these high quality digital objects publicly since they may be very large in size, therefore you may need to store these high quality digital objects on an internal platform and only make digitally compressed versions available publicly. For the purposes of this wiki help page, the high quality digitised object is referred to as the "archival digital object".

Digitisation process

Step 1

Determine the scope of the project by asking the following questions:

  1. What will be digitised?
  2. Who will perform the actual digitisation?
  3. How long will the digitisation take?
  4. What standards will be applied to the resultant archival digital objects?
  5. And finally, what will it cost to digitise the identified items, in the appropriate time with the appropriate digital object standards?

Step 2

From the scope defined above, determine if there is capacity to store and manage the resultant archival digital objects in the long term by asking the following questions:

  1. How and where will the resultant archival digital objects be stored?
  2. Is there enough archival storage capacity for the digital objects?
  3. Who will curate the provenance of the resultant archival digital objects, in the short and long term?
  4. How will we implement a disaster recovery system for the archival digital objects in storage?

Step 3

Now that the items identified have been digitised and stored, the next step is to determine how the digitised objects are made available to users of the library by asking the following questions:

  1. Are there any intellectual property concerns regarding the digital objects?
  2. If the digital object can be made public, what digital format is appropriate for public consumption?
  3. What platform will be used for public dissemination of the digitised items?
  4. Who will submit the digital objects to the public platform?
  5. What metadata standards will be applied to the digital objects stored on the public platform?
  6. Does the public platform have enough storage capacity for the digital objects?
  7. Does the public platform have enough computing capacity to deal with the number of anticipated users who will visit and download or view the digital objects?
  8. Does the public platform implement a disaster recovery system?

Step 4

To ensure these digital objects are preserved for future users of the library, the library top management should develop a digital preservation policy and action plans.

Service Providers

Training

References

Digital Formats

Preservable Digital Objects

Digitisation Equipment and Services

Guidelines For Scanners