Comment

Comment on Microsoft and Reddit Are Fighting About Why Bing’s Crawler Is Blocked on Reddit

A search engine can’t pay a website for having the honor of bringing them visits and ad views.

Fuck reddit, get delisted, no problem.

Weird that google is ignoring their robots.txt though.

Even if they pay them for being able to say that glue is perfect on pizza, having

User-agent: *
Disallow: /

should block googlebot too. That means google programmed an exception on googlebot to ignore robots.txt on that domain and that shouldn’t be done. What’s the purpose of that file then?

Because robots.txt is just based on honor, should be

User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /

source

Sort:hotnew top

MrSoup@lemmy.zip ⁨1⁩ ⁨year⁩ ago
I doubt Google respects any robots.txt

source
- DaGeek247@fedia.io ⁨1⁩ ⁨year⁩ ago
  My robots.txt has been respected by every bot that visited it in the past three months. I know this because i wrote a page that IP bans anything that visits it, and l also put it as a not allowed spot in the robots.txt file.
  
  I've only gotten like, 20 visits in the past three months though, so, very small sample size.
  
  source
  - mozz@mbin.grits.dev ⁨1⁩ ⁨year⁩ ago
    
    I know this because i wrote a page that IP bans anything that visits it, and l also put it as a not allowed spot in the robots.txt file.
    
    This is fuckin GENIUS
    
    source
    Moonrise2473@feddit.it ⁨1⁩ ⁨year⁩ ago
    only if you don’t want any visits except from yourself, because this removes your site from any search engine
    
    should write a “disallow: /juicy-content” and then block anything that tries to access that page (only bad bots would follow that path)
    
    source
    -> View More Comments
  - MrSoup@lemmy.zip ⁨1⁩ ⁨year⁩ ago
    Thank you for sharing
    
    source
  - thingsiplay@beehaw.org ⁨1⁩ ⁨year⁩ ago
    Interesting way of testing this. Another would be to search the search machines with adding -site:your.domain to show results from your site only. Not an exhaustive check, but another tool to test this behavior.
    
    source
- Moonrise2473@feddit.it ⁨1⁩ ⁨year⁩ ago
  for common people they respect and even warn a webmaster if they submit a sitemap that has paths included in robots.txt
  
  source
skullgiver@popplesburger.hilciferous.nl ⁨1⁩ ⁨year⁩ ago
[deleted]
source
- Zoop@beehaw.org ⁨1⁩ ⁨year⁩ ago
  
  User-Agent: bender
  
  Disallow: /my_shiny_metal_ass
  
  Ha!
  
  source
tal@lemmy.today ⁨1⁩ ⁨year⁩ ago
I guessed in a previous comment that given their new partnership, Reddit is probably feeding their comment database to Google directly, which reduces load for both of then and permits Google to have real-time updates of the whole kit-and-kaboodle rather than polling individual pages.

source
jarfil@beehaw.org ⁨1⁩ ⁨year⁩ ago
Google is paying for the use of Reddit’s API, not for scraping the site.

That’s the new Reddit’s business model: want “their” (users’) content, then pay for API access.

source