Don't you hate morons that abuse RSS!
More and more incompetent morons from IT put their websites' RSS feeds behind Cloudflare Turnstile, which giddily greets robots with captcha
Not a single time was I able to communicate to these imbeciles the sheer degree of idiocy involved. Personally, I blame Cloudflare that gives these fucking monkeys a live grenade to play with.
What’s a nerd to do? Huginn to the rescue!
First, we need to get the RSS feed. As I mentioned, it’s behind CF. So, FlareSolverR to the rescue? Nope - it stopped working with CF-Turnstile quite some time ago, maintainers clearly don’t give a shit, so there. But we’re on the internet, so some unsung hero found a way to fix it so we’re in business!
So, getting the XML? (these are Huginn Agent calls)
Post Agent: 090-get-rss:
{
"post_url": "http://my-flaresolverr.lan/v1",
"expected_receive_period_in_days": "1",
"content_type": "json",
"method": "post",
"payload": {
"cmd": "request.get",
"url": "https://some-idiot-admin-hides-behind-cf.com/feed/atom/"
},
"headers": {},
"emit_events": "true",
"parse_body": "true",
"no_merge": "true",
"output_mode": "clean"
}
Success! body.solution.response
has our XML? Or is it HTML? Crap, we need to use Chrome’s user agent to screw with CF, but the framework that the idiot admin uses thinks that if our browser is Chrome, we want an HTML-wrapped XML!
Let’s cut it:
Website Agent: 091-cut-xml-from-html
{
"expected_update_period_in_days": "1",
"data_from_event": "{{body.solution.response}}",
"type": "html",
"mode": "on_change",
"extract": {
"xml": {
"xpath": "/html/body/pre/text()",
"value": "."
}
}
}
The framework even escapes the XML so it’s unreadable by robots, need to un-escape:
Website Agent: 092-parse-xml
{
"expected_update_period_in_days": "1",
"data_from_event": "{{xml | unescape}}",
"type": "xml",
"mode": "on_change",
"extract": {
"url": {
"xpath": "/feed/entry/id/text()",
"value": "."
}
}
}
We did it! Now we have a set of events each containing a URL, let’s scrape the website that hides everything, so users would click-click-click and (not) see the ads.
Post Agent: 093-scrape-pages:
{
"post_url": "http://my-flaresolverr.lan/v1",
"expected_receive_period_in_days": "1",
"content_type": "json",
"method": "post",
"payload": {
"cmd": "request.get",
"url": "{{url}}"
},
"headers": {},
"emit_events": "true",
"parse_body": "true",
"no_merge": "true",
"output_mode": "clean"
}
Now it’s time to use xpath
to extract all that we need:
Website Agent: 094-parse-scraped-pages:
{
"expected_update_period_in_days": "1",
"data_from_event": "{{body.solution.response}}",
"type": "html",
"mode": "on_change",
"extract": {
"url": {
"xpath": "/html/body/div[1]/div[3]/div[1]/div[2]/h2/a/@href",
"value": "."
},
"title": {
"xpath": "/html/head/title/text()",
"value": "."
},
"content": {
"xpath": "//*[@class=\"storycontent\"]",
"value": "."
}
}
}
And, finally, output our own RSS.
Data Output Agent: 095-output-rss:
{
"secrets": [
"shitty-site-scraper"
],
"expected_receive_period_in_days": "1",
"template": {
"title": "shitsite rss with all data",
"description": "blablabla",
"item": {
"title": "{{title}}",
"description": "{{content}}",
"link": "{{url}}",
"gid": "{{url}}"
}
},
"ns_media": "false"
}
Hopefully, someone will feel good about not having to go through this all over.