SUNScholar/Repository Website Metrics/Robots

BACK TO REPOSITORY WEBSITE METRICS

Introduction
A proper robots.txt file is essential in ensuring that web robots (Google, Bing etc..) have the right permissions to index your repository web site.

For an example visit: http://scholar.sun.ac.za/robots.txt.

Step 1
To enable a "robots.txt" file for the XMLUI, type the following: mkdir $HOME//dspace/modules/xmlui/src/main/webapp/static

nano $HOME//dspace/modules/xmlui/src/main/webapp/static/robots.txt

See example below. Sitemap: ${dspace.url}/sitemap Sitemap: ${dspace.url}/htmlmap
 * 1) The FULL URL to the DSpace sitemaps
 * 2) The ${dspace.url} will be auto-filled with the value in dspace.cfg
 * 3) XML sitemap is listed first as it is preferred by most search engines

User-agent: * Disallow: /discover Disallow: /search-filter
 * 1) Default Access Group
 * 2) (NOTE: blank lines are not allowable in a group record)
 * 1) (NOTE: blank lines are not allowable in a group record)
 * 1) Disable access to Discovery search and filters
 * 1) Optionally uncomment the following line ONLY if sitemaps are working
 * 2) and you have verified that your site is being indexed correctly.
 * 3) Disallow: /browse
 * 4) If you have configured DSpace (Solr-based) Statistics to be publicly
 * 5) accessible, then you may not want this content to be indexed
 * 6) Disallow: /statistics
 * 7) You also may wish to disallow access to the following paths, in order
 * 8) to stop web spiders from accessing user-based content
 * 9) Disallow: /contact
 * 10) Disallow: /feedback
 * 11) Disallow: /forgot
 * 12) Disallow: /login
 * 13) Disallow: /register
 * 1) Disallow: /forgot
 * 2) Disallow: /login
 * 3) Disallow: /register


 * 1) Section for misbehaving bots
 * 2) The following directives to block specific robots were borrowed from Wikipedia's robots.txt
 * 1) The following directives to block specific robots were borrowed from Wikipedia's robots.txt

User-agent: Mediapartners-Google* Disallow: /
 * 1) advertising-related bots:

User-agent: UbiCrawler Disallow: /
 * 1) Crawlers that are kind enough to obey, but which we'd rather not have
 * 2) unless they're feeding search engines.

User-agent: DOC Disallow: /

User-agent: Zao Disallow: /

User-agent: sitecheck.internetseer.com Disallow: /
 * 1) Some bots are known to be trouble, particularly those designed to copy
 * 2) entire sites. Please obey robots.txt.

User-agent: Zealbot Disallow: /

User-agent: MSIECrawler Disallow: /

User-agent: SiteSnagger Disallow: /

User-agent: WebStripper Disallow: /

User-agent: WebCopier Disallow: /

User-agent: Fetch Disallow: /

User-agent: Offline Explorer Disallow: /

User-agent: Teleport Disallow: /

User-agent: TeleportPro Disallow: /

User-agent: WebZIP Disallow: /

User-agent: linko Disallow: /

User-agent: HTTrack Disallow: /

User-agent: Microsoft.URL.Control Disallow: /

User-agent: Xenu Disallow: /

User-agent: larbin Disallow: /

User-agent: libwww Disallow: /

User-agent: ZyBORG Disallow: /

User-agent: Download Ninja Disallow: /

User-agent: fast Disallow: /
 * 1) Misbehaving: requests much too fast:


 * 1) If your DSpace is going down because of someone using recursive wget,
 * 2) you can activate the following rule.
 * 3) If your own faculty is bringing down your dspace with recursive wget,
 * 4) you can advise them to use the --wait option to set the delay between hits.
 * 5) User-agent: wget
 * 6) Disallow: /
 * 1) User-agent: wget
 * 2) Disallow: /
 * 1) Disallow: /

User-agent: grub-client Disallow: /
 * 1) The 'grub' distributed client has been *very* poorly behaved.
 * 1) The 'grub' distributed client has been *very* poorly behaved.

User-agent: k2spider Disallow: /
 * 1) Doesn't follow robots.txt anyway, but...
 * 1) Doesn't follow robots.txt anyway, but...

User-agent: NPBot Disallow: /
 * 1) Hits many times per second, not acceptable
 * 2) http://www.nameprotect.com/botinfo.html
 * 1) http://www.nameprotect.com/botinfo.html

User-agent: WebReaper Disallow: /
 * 1) A capture bot, downloads gazillions of pages with no public benefit
 * 2) http://www.webreaper.net/

Step 2
Then rebuild DSpace to activate.