Even though I only do it for hobby projects, crawling pages is becoming increasingly difficult unless you are a big player like Google or Microsoft with a whitelisted IP range.
I've had some success in scraping lately with a similar project called FlareSolverr(1).
It's purpose it to get you access to sites which won't let you crawl unless you are using a real browser (e.g amazon, instagram). It doesn't hide your IP but uses puppeteer with stealth mode to get you access to otherwise restricted urls.
For one pet project I had to crawl a rather popular site. While that worked in general I would frequently get internal server errors. Turns out that this was the response of their CDN when it detected bot-like behavior. That left me wondering how they prevent search engine crawlers from being detected as bots and getting throttled this way as well. Turns out they just check the user agent for that. As soon as I put "Googlebot" in my user agent the frequent errors vanished. So sometimes it's not about using the right IP addresses, but just the right keywords in your user agent. ;-)
This probably means incompetency on the CDN’s part - Cloudflare has a detection rule for fake googlebots and it checks by doing reverse DNS to see if it’s really Google. Doing this trick is more likely to get your IP marked for spam, at least if you crawl CF sites with it.
How does Google get around cloaking? Don't they need to visit a site from time to time without coming in as Googlebot to make sure they're getting presented the same page as Googlebot?
They might pretend to be on other networks they own when they do things like ad/policy reviews (eg. the Google Fi or Google Fiber ASNs[0]), but I don't know of anyone confirming that this happens.
Amazon's marketplace APIs are available to developers who register through a seller account. Pricing is $40 per month.
The main benefit of using the API is that you can request a LOT of data without hitting their rate limit. Unless you need to get dozens of results per second, you are usually better off with a spider (or just use Huginn). And if you are hit with a 503 and a captcha, gluing a free captcha solver with some middleware is a trivial task.
But other comments say scraping Amazon is kind of complicated because they ban IPs? I am not sure if you have a seller / affiliate account and then use your home IP to do scaping, will that impact your seller / affiliate account ?
You shouldn't use the same IP to continuously scrape Amazon. Personally I use a $8/month rotating proxy service that gives me a new proxy list every hour (webshare.io if it piques your interest, I'm in no way associated with them).
Also, in my experience Amazon's "ban" comes down to solving a captcha on every request, so it's more like some mild throttling than a real ban.
I'm using a ML-based captcha solver that is free software available on Github. So far it has solved 100% of the Amazon captchas it has encountered.
The reason Amazon have a reputation of being good at blocking IPs is that their responses are (purposefully?) obscure. The way it works, it filters out script kiddies and lets engineers through, which probably are a small minority of the people scraping Amazon.
I also haven't looked into this space outside of a quick DDG search after reading your comment. It looks like the big hole is that Amazon rolled their own captcha a while ago and haven't kept up with what automation can do now.
There's a bot in an IRC channel I've been on for over a decade that announces the <title> of any link being mentioned in the chan. It's becoming less and less useful as it's running on someone's vps, and a lot of sites behind cloudflare don't yield anything as they're returning the "checking your browser" page to the bot. Then there are pages that are pure javascript a d don't even deliver a title tag, and others try to show a GDPR banner or paywall and thus yield some generic title and not whatever article the link is supposed to show.
I guess it's time for a user-side script that sends the http request through their daily-driver browser to see what it is, but then they're getting their home computer to visit any and every link... Maybe only when it sees cloudflare DNS...
I've had some success in scraping lately with a similar project called FlareSolverr(1).
It's purpose it to get you access to sites which won't let you crawl unless you are using a real browser (e.g amazon, instagram). It doesn't hide your IP but uses puppeteer with stealth mode to get you access to otherwise restricted urls.
(1) https://github.com/FlareSolverr/FlareSolverr