Even though I only do it for hobby projects, crawling pages is becoming increasi...

Dunedan · on June 26, 2021

For one pet project I had to crawl a rather popular site. While that worked in general I would frequently get internal server errors. Turns out that this was the response of their CDN when it detected bot-like behavior. That left me wondering how they prevent search engine crawlers from being detected as bots and getting throttled this way as well. Turns out they just check the user agent for that. As soon as I put "Googlebot" in my user agent the frequent errors vanished. So sometimes it's not about using the right IP addresses, but just the right keywords in your user agent. ;-)

judge2020 · on June 26, 2021

This probably means incompetency on the CDN’s part - Cloudflare has a detection rule for fake googlebots and it checks by doing reverse DNS to see if it’s really Google. Doing this trick is more likely to get your IP marked for spam, at least if you crawl CF sites with it.

https://developers.google.com/search/docs/advanced/crawling/...

Scoundreller · on June 26, 2021

How does Google get around cloaking? Don't they need to visit a site from time to time without coming in as Googlebot to make sure they're getting presented the same page as Googlebot?

Or is that always done via manual review?

judge2020 · on June 27, 2021

They might pretend to be on other networks they own when they do things like ad/policy reviews (eg. the Google Fi or Google Fiber ASNs[0]), but I don't know of anyone confirming that this happens.

0: https://bgp.he.net/AS16591

beebeepka · on June 26, 2021

Changing user agent to Google it or whatever also lets you past quite a few paywalls too. Or at least it used to. These days I don't even bother

temp8964 · on June 26, 2021

Doesn't Amazon have an API? Why everyone wants to scrape Amazon? Honest question.

mkl95 · on June 26, 2021

Amazon's marketplace APIs are available to developers who register through a seller account. Pricing is $40 per month.

The main benefit of using the API is that you can request a LOT of data without hitting their rate limit. Unless you need to get dozens of results per second, you are usually better off with a spider (or just use Huginn). And if you are hit with a 503 and a captcha, gluing a free captcha solver with some middleware is a trivial task.

temp8964 · on June 26, 2021

But other comments say scraping Amazon is kind of complicated because they ban IPs? I am not sure if you have a seller / affiliate account and then use your home IP to do scaping, will that impact your seller / affiliate account ?

mkl95 · on June 26, 2021

You shouldn't use the same IP to continuously scrape Amazon. Personally I use a $8/month rotating proxy service that gives me a new proxy list every hour (webshare.io if it piques your interest, I'm in no way associated with them).

Also, in my experience Amazon's "ban" comes down to solving a captcha on every request, so it's more like some mild throttling than a real ban.

moehm · on June 26, 2021

> free captcha solver

I haven't looked into this space, but what is a free captcha solver? And what is the drawback? Wouldn't this defeat the purpose of a captcha?

mkl95 · on June 27, 2021

I'm using a ML-based captcha solver that is free software available on Github. So far it has solved 100% of the Amazon captchas it has encountered.

The reason Amazon have a reputation of being good at blocking IPs is that their responses are (purposefully?) obscure. The way it works, it filters out script kiddies and lets engineers through, which probably are a small minority of the people scraping Amazon.

DoctorOW · on June 26, 2021

I also haven't looked into this space outside of a quick DDG search after reading your comment. It looks like the big hole is that Amazon rolled their own captcha a while ago and haven't kept up with what automation can do now.

helsinkiandrew · on June 26, 2021

Yes, all their product data is available for affiliates: https://webservices.amazon.com/paapi5/documentation/

temp8964 · on June 26, 2021

Also, the document says "Product Advertising API is free". I am confused.

moltar · on June 26, 2021

The access is only provided to active affiliates with sales. Amount of access is counted by number of sales too. More sales = more API access.

temp8964 · on June 26, 2021

Thanks. This is very helpful. This information is not provided on the documentation.

tethys · on June 26, 2021

API rates can be found here: https://webservices.amazon.com/paapi5/documentation/troubles...

temp8964 · on June 26, 2021

Thanks. This is really annoying. Now I understand why people just want to scrape.

avipars · on June 26, 2021

Amazon has one for their affiliates, you need to be approved and get 3 sales attributed to you in order to apply.

iforgotpassword · on June 26, 2021

There's a bot in an IRC channel I've been on for over a decade that announces the <title> of any link being mentioned in the chan. It's becoming less and less useful as it's running on someone's vps, and a lot of sites behind cloudflare don't yield anything as they're returning the "checking your browser" page to the bot. Then there are pages that are pure javascript a d don't even deliver a title tag, and others try to show a GDPR banner or paywall and thus yield some generic title and not whatever article the link is supposed to show.

Scoundreller · on June 27, 2021

I guess it's time for a user-side script that sends the http request through their daily-driver browser to see what it is, but then they're getting their home computer to visit any and every link... Maybe only when it sees cloudflare DNS...

MuffinFlavored · on June 26, 2021

> puppeteer with stealth mode

I think sites like Zillow can detect after your third or fourth interaction that your actions aren't very human like and will prompt you a captcha.