-
Notifications
You must be signed in to change notification settings - Fork 6
Googlebot ignores robots.txt? #20
Copy link
Copy link
Open
Description
last line in apache log file:
66.249.65.212 - - [20/Sep/2018:12:29:58 -0400] "GET /crcns/.git/objects/29/f8a0ae8c2ad4e7534b12f3cb68b9e8247b1933 HTTP/1.1" 200 1745 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
$> cat robots.txt
Agent: *
Disallow: /abide
Disallow: /abide2
Disallow: /adhd200
Disallow: /allen-brain-observatory
Disallow: /balsa
Disallow: /corr
Disallow: /crcns
Disallow: /datapackage.json
Disallow: /dbic
Disallow: /devel
Disallow: /dicoms
Disallow: /.git
Disallow: /.gitattributes
Disallow: /.gitmodules
Disallow: /hbnssi
Disallow: /index.html
Disallow: /indi
Disallow: /kaggle
Disallow: /labs
Disallow: /neurovault
Disallow: /nidm
Disallow: /openfmri
Disallow: /singularity
Disallow: /workshops
$> whois 66.249.65.212
#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
# If you see inaccuracies in the results, please report at
# https://www.arin.net/resources/whois_reporting/index.html
#
# Copyright 1997-2018, American Registry for Internet Numbers, Ltd.
#
NetRange: 66.249.64.0 - 66.249.95.255
CIDR: 66.249.64.0/19
NetName: GOOGLE
NetHandle: NET-66-249-64-0-1
Parent: NET66 (NET-66-0-0-0-0)
...
and robots.txt is accessed by google bots:
$> grep robots.txt datasets.datalad.org-access-comb.log | grep Google
66.249.79.206 - - [18/Sep/2018:05:31:22 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.204 - - [19/Sep/2018:05:34:02 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.97 - - [19/Sep/2018:18:08:17 -0400] "GET /robots.txt HTTP/1.1" 200 4030 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.204 - - [20/Sep/2018:05:36:22 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
@aqw - have a clue what is going on?
- did I misspecify robots.txt may be?
- access pattern from that host is interesting in that it is selectively accessing only some datasets, but may be it is just because it is all farmed out to a bunch of Googlebot instances
Overall goal is to forbid bots to crawl .git/ directories, but I found no way to disable that.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels