Skip to content

Googlebot ignores robots.txt? #20

@yarikoptic

Description

@yarikoptic

last line in apache log file:

66.249.65.212 - - [20/Sep/2018:12:29:58 -0400] "GET /crcns/.git/objects/29/f8a0ae8c2ad4e7534b12f3cb68b9e8247b1933 HTTP/1.1" 200 1745 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

$> cat robots.txt 
Agent: *
Disallow: /abide
Disallow: /abide2
Disallow: /adhd200
Disallow: /allen-brain-observatory
Disallow: /balsa
Disallow: /corr
Disallow: /crcns
Disallow: /datapackage.json
Disallow: /dbic
Disallow: /devel
Disallow: /dicoms
Disallow: /.git
Disallow: /.gitattributes
Disallow: /.gitmodules
Disallow: /hbnssi
Disallow: /index.html
Disallow: /indi
Disallow: /kaggle
Disallow: /labs
Disallow: /neurovault
Disallow: /nidm
Disallow: /openfmri
Disallow: /singularity
Disallow: /workshops

$> whois 66.249.65.212

#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
# If you see inaccuracies in the results, please report at
# https://www.arin.net/resources/whois_reporting/index.html
#
# Copyright 1997-2018, American Registry for Internet Numbers, Ltd.
#


NetRange:       66.249.64.0 - 66.249.95.255
CIDR:           66.249.64.0/19
NetName:        GOOGLE
NetHandle:      NET-66-249-64-0-1
Parent:         NET66 (NET-66-0-0-0-0)
...

and robots.txt is accessed by google bots:

$> grep robots.txt datasets.datalad.org-access-comb.log | grep Google
66.249.79.206 - - [18/Sep/2018:05:31:22 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.204 - - [19/Sep/2018:05:34:02 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.97 - - [19/Sep/2018:18:08:17 -0400] "GET /robots.txt HTTP/1.1" 200 4030 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.79.204 - - [20/Sep/2018:05:36:22 -0400] "GET /robots.txt HTTP/1.1" 200 538 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

@aqw - have a clue what is going on?

  • did I misspecify robots.txt may be?
  • access pattern from that host is interesting in that it is selectively accessing only some datasets, but may be it is just because it is all farmed out to a bunch of Googlebot instances

Overall goal is to forbid bots to crawl .git/ directories, but I found no way to disable that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions