Difference between revisions of "SUNScholar/Repository Website Metrics/Robots"
Jump to navigation
Jump to search
(Created page with "<center> '''BACK TO REPOSITORY WEBSITE METRICS''' </center>") |
m (→References) |
||
| (6 intermediate revisions by the same user not shown) | |||
| Line 2: | Line 2: | ||
'''[[SUNScholar/Repository Website Metrics|BACK TO REPOSITORY WEBSITE METRICS]]''' | '''[[SUNScholar/Repository Website Metrics|BACK TO REPOSITORY WEBSITE METRICS]]''' | ||
</center> | </center> | ||
| + | ==Introduction== | ||
| + | A proper <tt>robots.txt</tt> file is essential in ensuring that web robots (Google, Bing etc..) have the right permissions to index your repository web site. | ||
| + | |||
| + | For an example visit: '''<tt>http://scholar.sun.ac.za/robots.txt</tt>'''. | ||
| + | |||
| + | ==Procedure== | ||
| + | ===Step 1=== | ||
| + | To enable a "robots.txt" file for the XMLUI, type the following: | ||
| + | |||
| + | mkdir $HOME/{{Source}}/dspace/modules/xmlui/src/main/webapp/static | ||
| + | |||
| + | nano $HOME/{{Source}}/dspace/modules/xmlui/src/main/webapp/static/robots.txt | ||
| + | |||
| + | See example below. | ||
| + | <pre> | ||
| + | # The FULL URL to the DSpace sitemaps | ||
| + | # The ${dspace.url} will be auto-filled with the value in dspace.cfg | ||
| + | # XML sitemap is listed first as it is preferred by most search engines | ||
| + | Sitemap: ${dspace.url}/sitemap | ||
| + | Sitemap: ${dspace.url}/htmlmap | ||
| + | |||
| + | ########################## | ||
| + | # Default Access Group | ||
| + | # (NOTE: blank lines are not allowable in a group record) | ||
| + | ########################## | ||
| + | User-agent: * | ||
| + | # Disable access to Discovery search and filters | ||
| + | Disallow: /discover | ||
| + | Disallow: /search-filter | ||
| + | # | ||
| + | # Optionally uncomment the following line ONLY if sitemaps are working | ||
| + | # and you have verified that your site is being indexed correctly. | ||
| + | # Disallow: /browse | ||
| + | # | ||
| + | # If you have configured DSpace (Solr-based) Statistics to be publicly | ||
| + | # accessible, then you may not want this content to be indexed | ||
| + | # Disallow: /statistics | ||
| + | # | ||
| + | # You also may wish to disallow access to the following paths, in order | ||
| + | # to stop web spiders from accessing user-based content | ||
| + | # Disallow: /contact | ||
| + | # Disallow: /feedback | ||
| + | # Disallow: /forgot | ||
| + | # Disallow: /login | ||
| + | # Disallow: /register | ||
| + | |||
| + | |||
| + | ############################## | ||
| + | # Section for misbehaving bots | ||
| + | # The following directives to block specific robots were borrowed from Wikipedia's robots.txt | ||
| + | ############################## | ||
| + | |||
| + | # advertising-related bots: | ||
| + | User-agent: Mediapartners-Google* | ||
| + | Disallow: / | ||
| + | |||
| + | # Crawlers that are kind enough to obey, but which we'd rather not have | ||
| + | # unless they're feeding search engines. | ||
| + | User-agent: UbiCrawler | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: DOC | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: Zao | ||
| + | Disallow: / | ||
| + | |||
| + | # Some bots are known to be trouble, particularly those designed to copy | ||
| + | # entire sites. Please obey robots.txt. | ||
| + | User-agent: sitecheck.internetseer.com | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: Zealbot | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: MSIECrawler | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: SiteSnagger | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: WebStripper | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: WebCopier | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: Fetch | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: Offline Explorer | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: Teleport | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: TeleportPro | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: WebZIP | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: linko | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: HTTrack | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: Microsoft.URL.Control | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: Xenu | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: larbin | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: libwww | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: ZyBORG | ||
| + | Disallow: / | ||
| + | |||
| + | User-agent: Download Ninja | ||
| + | Disallow: / | ||
| + | |||
| + | # Misbehaving: requests much too fast: | ||
| + | User-agent: fast | ||
| + | Disallow: / | ||
| + | |||
| + | # | ||
| + | # If your DSpace is going down because of someone using recursive wget, | ||
| + | # you can activate the following rule. | ||
| + | # | ||
| + | # If your own faculty is bringing down your dspace with recursive wget, | ||
| + | # you can advise them to use the --wait option to set the delay between hits. | ||
| + | # | ||
| + | #User-agent: wget | ||
| + | #Disallow: / | ||
| + | |||
| + | # | ||
| + | # The 'grub' distributed client has been *very* poorly behaved. | ||
| + | # | ||
| + | User-agent: grub-client | ||
| + | Disallow: / | ||
| + | |||
| + | # | ||
| + | # Doesn't follow robots.txt anyway, but... | ||
| + | # | ||
| + | User-agent: k2spider | ||
| + | Disallow: / | ||
| + | |||
| + | # | ||
| + | # Hits many times per second, not acceptable | ||
| + | # http://www.nameprotect.com/botinfo.html | ||
| + | User-agent: NPBot | ||
| + | Disallow: / | ||
| + | |||
| + | # A capture bot, downloads gazillions of pages with no public benefit | ||
| + | # http://www.webreaper.net/ | ||
| + | User-agent: WebReaper | ||
| + | Disallow: / | ||
| + | </pre> | ||
| + | |||
| + | ===Step 2=== | ||
| + | Then [[SUNScholar/Rebuild_DSpace|rebuild DSpace]] to activate. | ||
| + | |||
| + | ==References== | ||
| + | * https://github.com/DSpace/DSpace/pull/498 | ||
| + | * https://github.com/DSpace/DSpace/blob/master/dspace-xmlui/src/main/webapp/static/robots.txt | ||
| + | [[Category:Operations]] | ||
Latest revision as of 16:07, 29 May 2016
BACK TO REPOSITORY WEBSITE METRICS
Introduction
A proper robots.txt file is essential in ensuring that web robots (Google, Bing etc..) have the right permissions to index your repository web site.
For an example visit: http://scholar.sun.ac.za/robots.txt.
Procedure
Step 1
To enable a "robots.txt" file for the XMLUI, type the following:
mkdir $HOME/source/dspace/modules/xmlui/src/main/webapp/static
nano $HOME/source/dspace/modules/xmlui/src/main/webapp/static/robots.txt
See example below.
# The FULL URL to the DSpace sitemaps
# The ${dspace.url} will be auto-filled with the value in dspace.cfg
# XML sitemap is listed first as it is preferred by most search engines
Sitemap: ${dspace.url}/sitemap
Sitemap: ${dspace.url}/htmlmap
##########################
# Default Access Group
# (NOTE: blank lines are not allowable in a group record)
##########################
User-agent: *
# Disable access to Discovery search and filters
Disallow: /discover
Disallow: /search-filter
#
# Optionally uncomment the following line ONLY if sitemaps are working
# and you have verified that your site is being indexed correctly.
# Disallow: /browse
#
# If you have configured DSpace (Solr-based) Statistics to be publicly
# accessible, then you may not want this content to be indexed
# Disallow: /statistics
#
# You also may wish to disallow access to the following paths, in order
# to stop web spiders from accessing user-based content
# Disallow: /contact
# Disallow: /feedback
# Disallow: /forgot
# Disallow: /login
# Disallow: /register
##############################
# Section for misbehaving bots
# The following directives to block specific robots were borrowed from Wikipedia's robots.txt
##############################
# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /
# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /
User-agent: DOC
Disallow: /
User-agent: Zao
Disallow: /
# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /
User-agent: Zealbot
Disallow: /
User-agent: MSIECrawler
Disallow: /
User-agent: SiteSnagger
Disallow: /
User-agent: WebStripper
Disallow: /
User-agent: WebCopier
Disallow: /
User-agent: Fetch
Disallow: /
User-agent: Offline Explorer
Disallow: /
User-agent: Teleport
Disallow: /
User-agent: TeleportPro
Disallow: /
User-agent: WebZIP
Disallow: /
User-agent: linko
Disallow: /
User-agent: HTTrack
Disallow: /
User-agent: Microsoft.URL.Control
Disallow: /
User-agent: Xenu
Disallow: /
User-agent: larbin
Disallow: /
User-agent: libwww
Disallow: /
User-agent: ZyBORG
Disallow: /
User-agent: Download Ninja
Disallow: /
# Misbehaving: requests much too fast:
User-agent: fast
Disallow: /
#
# If your DSpace is going down because of someone using recursive wget,
# you can activate the following rule.
#
# If your own faculty is bringing down your dspace with recursive wget,
# you can advise them to use the --wait option to set the delay between hits.
#
#User-agent: wget
#Disallow: /
#
# The 'grub' distributed client has been *very* poorly behaved.
#
User-agent: grub-client
Disallow: /
#
# Doesn't follow robots.txt anyway, but...
#
User-agent: k2spider
Disallow: /
#
# Hits many times per second, not acceptable
# http://www.nameprotect.com/botinfo.html
User-agent: NPBot
Disallow: /
# A capture bot, downloads gazillions of pages with no public benefit
# http://www.webreaper.net/
User-agent: WebReaper
Disallow: /
Step 2
Then rebuild DSpace to activate.