Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Even though I only do it for hobby projects, crawling pages is becoming increasingly difficult unless you are a big player like Google or Microsoft with a whitelisted IP range.

I've had some success in scraping lately with a similar project called FlareSolverr(1).

It's purpose it to get you access to sites which won't let you crawl unless you are using a real browser (e.g amazon, instagram). It doesn't hide your IP but uses puppeteer with stealth mode to get you access to otherwise restricted urls.

(1) https://github.com/FlareSolverr/FlareSolverr



For one pet project I had to crawl a rather popular site. While that worked in general I would frequently get internal server errors. Turns out that this was the response of their CDN when it detected bot-like behavior. That left me wondering how they prevent search engine crawlers from being detected as bots and getting throttled this way as well. Turns out they just check the user agent for that. As soon as I put "Googlebot" in my user agent the frequent errors vanished. So sometimes it's not about using the right IP addresses, but just the right keywords in your user agent. ;-)


This probably means incompetency on the CDN’s part - Cloudflare has a detection rule for fake googlebots and it checks by doing reverse DNS to see if it’s really Google. Doing this trick is more likely to get your IP marked for spam, at least if you crawl CF sites with it.

https://developers.google.com/search/docs/advanced/crawling/...


How does Google get around cloaking? Don't they need to visit a site from time to time without coming in as Googlebot to make sure they're getting presented the same page as Googlebot?

Or is that always done via manual review?


They might pretend to be on other networks they own when they do things like ad/policy reviews (eg. the Google Fi or Google Fiber ASNs[0]), but I don't know of anyone confirming that this happens.

0: https://bgp.he.net/AS16591


Changing user agent to Google it or whatever also lets you past quite a few paywalls too. Or at least it used to. These days I don't even bother


Doesn't Amazon have an API? Why everyone wants to scrape Amazon? Honest question.


Amazon's marketplace APIs are available to developers who register through a seller account. Pricing is $40 per month.

The main benefit of using the API is that you can request a LOT of data without hitting their rate limit. Unless you need to get dozens of results per second, you are usually better off with a spider (or just use Huginn). And if you are hit with a 503 and a captcha, gluing a free captcha solver with some middleware is a trivial task.


But other comments say scraping Amazon is kind of complicated because they ban IPs? I am not sure if you have a seller / affiliate account and then use your home IP to do scaping, will that impact your seller / affiliate account ?


You shouldn't use the same IP to continuously scrape Amazon. Personally I use a $8/month rotating proxy service that gives me a new proxy list every hour (webshare.io if it piques your interest, I'm in no way associated with them).

Also, in my experience Amazon's "ban" comes down to solving a captcha on every request, so it's more like some mild throttling than a real ban.


> free captcha solver

I haven't looked into this space, but what is a free captcha solver? And what is the drawback? Wouldn't this defeat the purpose of a captcha?


I'm using a ML-based captcha solver that is free software available on Github. So far it has solved 100% of the Amazon captchas it has encountered.

The reason Amazon have a reputation of being good at blocking IPs is that their responses are (purposefully?) obscure. The way it works, it filters out script kiddies and lets engineers through, which probably are a small minority of the people scraping Amazon.


I also haven't looked into this space outside of a quick DDG search after reading your comment. It looks like the big hole is that Amazon rolled their own captcha a while ago and haven't kept up with what automation can do now.


Yes, all their product data is available for affiliates: https://webservices.amazon.com/paapi5/documentation/


Also, the document says "Product Advertising API is free". I am confused.


The access is only provided to active affiliates with sales. Amount of access is counted by number of sales too. More sales = more API access.


Thanks. This is very helpful. This information is not provided on the documentation.



Thanks. This is really annoying. Now I understand why people just want to scrape.


Amazon has one for their affiliates, you need to be approved and get 3 sales attributed to you in order to apply.


There's a bot in an IRC channel I've been on for over a decade that announces the <title> of any link being mentioned in the chan. It's becoming less and less useful as it's running on someone's vps, and a lot of sites behind cloudflare don't yield anything as they're returning the "checking your browser" page to the bot. Then there are pages that are pure javascript a d don't even deliver a title tag, and others try to show a GDPR banner or paywall and thus yield some generic title and not whatever article the link is supposed to show.


I guess it's time for a user-side script that sends the http request through their daily-driver browser to see what it is, but then they're getting their home computer to visit any and every link... Maybe only when it sees cloudflare DNS...


> puppeteer with stealth mode

I think sites like Zillow can detect after your third or fourth interaction that your actions aren't very human like and will prompt you a captcha.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: