Difference between revisions of "SUNScholar/Repository Website Metrics/Robots"

From Libopedia
Jump to navigation Jump to search
m
 
(4 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
  '''[[SUNScholar/Repository Website Metrics|BACK TO REPOSITORY WEBSITE METRICS]]'''
 
  '''[[SUNScholar/Repository Website Metrics|BACK TO REPOSITORY WEBSITE METRICS]]'''
 
</center>
 
</center>
For an example visit: '''<tt>http://scholar.sun.ac.za/robots.txt</tt>''', then check the following.
+
==Introduction==
To enable, create a "robots.txt" file in the modules folder for the XMLUI as the logged in "dspace" user.
+
A proper <tt>robots.txt</tt> file is essential in ensuring that web robots (Google, Bing etc..) have the right permissions to index your repository web site.
 +
 
 +
For an example visit: '''<tt>http://scholar.sun.ac.za/robots.txt</tt>'''.
 +
 
 +
==Procedure==
 +
===Step 1===
 +
To enable a "robots.txt" file for the XMLUI, type the following:
 
   
 
   
 
  mkdir $HOME/{{Source}}/dspace/modules/xmlui/src/main/webapp/static
 
  mkdir $HOME/{{Source}}/dspace/modules/xmlui/src/main/webapp/static
  
 
  nano $HOME/{{Source}}/dspace/modules/xmlui/src/main/webapp/static/robots.txt
 
  nano $HOME/{{Source}}/dspace/modules/xmlui/src/main/webapp/static/robots.txt
Copy and paste the contents of the github file mentioned above and modify as required.
 
  
Then rebuild DSpace to activate.
+
See example below.
 +
<pre>
 +
# The FULL URL to the DSpace sitemaps
 +
# The ${dspace.url} will be auto-filled with the value in dspace.cfg
 +
# XML sitemap is listed first as it is preferred by most search engines
 +
Sitemap: ${dspace.url}/sitemap
 +
Sitemap: ${dspace.url}/htmlmap
 +
 
 +
##########################
 +
# Default Access Group
 +
# (NOTE: blank lines are not allowable in a group record)
 +
##########################
 +
User-agent: *
 +
# Disable access to Discovery search and filters
 +
Disallow: /discover
 +
Disallow: /search-filter
 +
#
 +
# Optionally uncomment the following line ONLY if sitemaps are working
 +
# and you have verified that your site is being indexed correctly.
 +
# Disallow: /browse
 +
#
 +
# If you have configured DSpace (Solr-based) Statistics to be publicly
 +
# accessible, then you may not want this content to be indexed
 +
# Disallow: /statistics
 +
#
 +
# You also may wish to disallow access to the following paths, in order
 +
# to stop web spiders from accessing user-based content
 +
# Disallow: /contact
 +
# Disallow: /feedback
 +
# Disallow: /forgot
 +
# Disallow: /login
 +
# Disallow: /register
 +
 
 +
 
 +
##############################
 +
# Section for misbehaving bots
 +
# The following directives to block specific robots were borrowed from Wikipedia's robots.txt
 +
##############################
 +
 
 +
# advertising-related bots:
 +
User-agent: Mediapartners-Google*
 +
Disallow: /
 +
 
 +
# Crawlers that are kind enough to obey, but which we'd rather not have
 +
# unless they're feeding search engines.
 +
User-agent: UbiCrawler
 +
Disallow: /
 +
 
 +
User-agent: DOC
 +
Disallow: /
 +
 
 +
User-agent: Zao
 +
Disallow: /
 +
 
 +
# Some bots are known to be trouble, particularly those designed to copy
 +
# entire sites. Please obey robots.txt.
 +
User-agent: sitecheck.internetseer.com
 +
Disallow: /
 +
 
 +
User-agent: Zealbot
 +
Disallow: /
 +
 
 +
User-agent: MSIECrawler
 +
Disallow: /
 +
 
 +
User-agent: SiteSnagger
 +
Disallow: /
 +
 
 +
User-agent: WebStripper
 +
Disallow: /
 +
 
 +
User-agent: WebCopier
 +
Disallow: /
 +
 
 +
User-agent: Fetch
 +
Disallow: /
 +
 
 +
User-agent: Offline Explorer
 +
Disallow: /
 +
 
 +
User-agent: Teleport
 +
Disallow: /
 +
 
 +
User-agent: TeleportPro
 +
Disallow: /
 +
 
 +
User-agent: WebZIP
 +
Disallow: /
 +
 
 +
User-agent: linko
 +
Disallow: /
 +
 
 +
User-agent: HTTrack
 +
Disallow: /
 +
 
 +
User-agent: Microsoft.URL.Control
 +
Disallow: /
 +
 
 +
User-agent: Xenu
 +
Disallow: /
 +
 
 +
User-agent: larbin
 +
Disallow: /
 +
 
 +
User-agent: libwww
 +
Disallow: /
 +
 
 +
User-agent: ZyBORG
 +
Disallow: /
 +
 
 +
User-agent: Download Ninja
 +
Disallow: /
 +
 
 +
# Misbehaving: requests much too fast:
 +
User-agent: fast
 +
Disallow: /
 +
 
 +
#
 +
# If your DSpace is going down because of someone using recursive wget,
 +
# you can activate the following rule.
 +
#
 +
# If your own faculty is bringing down your dspace with recursive wget,
 +
# you can advise them to use the --wait option to set the delay between hits.
 +
#
 +
#User-agent: wget
 +
#Disallow: /
 +
 
 +
#
 +
# The 'grub' distributed client has been *very* poorly behaved.
 +
#
 +
User-agent: grub-client
 +
Disallow: /
 +
 
 +
#
 +
# Doesn't follow robots.txt anyway, but...
 +
#
 +
User-agent: k2spider
 +
Disallow: /
 +
 
 +
#
 +
# Hits many times per second, not acceptable
 +
# http://www.nameprotect.com/botinfo.html
 +
User-agent: NPBot
 +
Disallow: /
 +
 
 +
# A capture bot, downloads gazillions of pages with no public benefit
 +
# http://www.webreaper.net/
 +
User-agent: WebReaper
 +
Disallow: /
 +
</pre>
 +
 
 +
===Step 2===
 +
Then [[SUNScholar/Rebuild_DSpace|rebuild DSpace]] to activate.
  
 
==References==
 
==References==
 
* https://github.com/DSpace/DSpace/pull/498
 
* https://github.com/DSpace/DSpace/pull/498
 
* https://github.com/DSpace/DSpace/blob/master/dspace-xmlui/src/main/webapp/static/robots.txt
 
* https://github.com/DSpace/DSpace/blob/master/dspace-xmlui/src/main/webapp/static/robots.txt
 +
[[Category:Operations]]

Latest revision as of 16:07, 29 May 2016

BACK TO REPOSITORY WEBSITE METRICS

Introduction

A proper robots.txt file is essential in ensuring that web robots (Google, Bing etc..) have the right permissions to index your repository web site.

For an example visit: http://scholar.sun.ac.za/robots.txt.

Procedure

Step 1

To enable a "robots.txt" file for the XMLUI, type the following:

mkdir $HOME/source/dspace/modules/xmlui/src/main/webapp/static
nano $HOME/source/dspace/modules/xmlui/src/main/webapp/static/robots.txt

See example below.

# The FULL URL to the DSpace sitemaps
# The ${dspace.url} will be auto-filled with the value in dspace.cfg
# XML sitemap is listed first as it is preferred by most search engines
Sitemap: ${dspace.url}/sitemap
Sitemap: ${dspace.url}/htmlmap

##########################
# Default Access Group
# (NOTE: blank lines are not allowable in a group record)
##########################
User-agent: *
# Disable access to Discovery search and filters
Disallow: /discover
Disallow: /search-filter
#
# Optionally uncomment the following line ONLY if sitemaps are working
# and you have verified that your site is being indexed correctly.
# Disallow: /browse
#
# If you have configured DSpace (Solr-based) Statistics to be publicly 
# accessible, then you may not want this content to be indexed
# Disallow: /statistics
#
# You also may wish to disallow access to the following paths, in order
# to stop web spiders from accessing user-based content
# Disallow: /contact
# Disallow: /feedback
# Disallow: /forgot
# Disallow: /login
# Disallow: /register


##############################
# Section for misbehaving bots
# The following directives to block specific robots were borrowed from Wikipedia's robots.txt
##############################

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Zealbot
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: Fetch
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: WebZIP
Disallow: /

User-agent: linko
Disallow: /

User-agent: HTTrack
Disallow: /

User-agent: Microsoft.URL.Control
Disallow: /

User-agent: Xenu
Disallow: /

User-agent: larbin
Disallow: /

User-agent: libwww
Disallow: /

User-agent: ZyBORG
Disallow: /

User-agent: Download Ninja
Disallow: /

# Misbehaving: requests much too fast:
User-agent: fast
Disallow: /

#
# If your DSpace is going down because of someone using recursive wget, 
# you can activate the following rule.
#
# If your own faculty is bringing down your dspace with recursive wget,
# you can advise them to use the --wait option to set the delay between hits.
#
#User-agent: wget
#Disallow: /

#
# The 'grub' distributed client has been *very* poorly behaved.
#
User-agent: grub-client
Disallow: /

#
# Doesn't follow robots.txt anyway, but...
#
User-agent: k2spider
Disallow: /

#
# Hits many times per second, not acceptable
# http://www.nameprotect.com/botinfo.html
User-agent: NPBot
Disallow: /

# A capture bot, downloads gazillions of pages with no public benefit
# http://www.webreaper.net/
User-agent: WebReaper
Disallow: /

Step 2

Then rebuild DSpace to activate.

References