No longer scraping past the first page. #1

sparremberger · 2020-02-26T22:09:30Z

Hello my friend. I read about your tool on medium and I must say it's very good, I've been in love with it. However, I came across a small problem which I just couldn't solve on my own. After trying to scrape for a single hashtag, it's only scraping the first 20 tweets, aparently because it's not able to fetch the next page. I'm using it on Windows 7, with Python properly set up and all it's dependencies. I suspect it might be due to some update on Twitter's end, but I'm not sure. Any help?
Thanks in advance.

amitupreti · 2020-02-27T01:24:39Z

Could You please share the hashtag you were trying to scrape? I will try to reproduce this on my computer. Thanks

sparremberger · 2020-02-27T05:47:09Z

Sure thing. I was trying to scrap the hashtag #festabbb on twitter, which is trending in my country. It has about 18k retweets as of now. I've tried different hashtags, but it still scrapes only the first 20 tweets i.e. the first page.

amitupreti · 2020-02-27T05:57:05Z

I just cloned the repo and ran the crawler for festabbb. It seems to be pulling data without any issues.
It pulled 699 items.(i had concurrent request set to 5)

Did you increase crawler speed? Please try lowering the settings
Did you put a #festabb in input? (Note you don't need to use the # sign just put festabbb.

sparremberger · 2020-02-27T06:00:05Z

I see. In that case, the problem must be on my end. And no, I didn't put the #, and I used all the default settings.
Are you using Linux? I'll try running Linux on a virtual machine and see if it works, Python is usually tricky on Windows. Thanks a lot!

amitupreti · 2020-02-27T06:01:57Z

yes, I am on Linux. But that shouldn't be an issue.

I will test on my windows when I am home and get back to you.

Could you try again with reduced concurrency?

I have noticed sometimes Twitter limits results for the same hashtags when i ran the crawler twice.

sparremberger · 2020-02-27T06:09:51Z

Yes, I just tried lowering concurrency to 4, and after that I tried increasing download delay from 3 to 300 (is it in milliseconds or seconds?)

By the way, I just tried setting ROBOTSTXT_OBEY to false, and surprisingly it seems to be working now! I'm not sure if I was supposed to set it to false since the beginning, but still.

amitupreti · 2020-02-27T06:13:08Z

download delay is in seconds. Please set it to 0 or 1(probably 0).

Yes. Settings ROBOTSTXT_OBEY to false is a great idea.
The crawler might be obeying some new settings in robots.txt of twitter.

Please let me know the crawl result.

Thanks

sparremberger · 2020-02-27T06:26:51Z

Just finished crawling over that hashtag once again, with 2 for download delay and 5 for concurrent requests. I'm trying to be gentle with Twitter servers so that they won't get mad and ban my IP.

It stopped after 621 tweets. It's way better than the 20 tweets I was getting previously, but still very far from the 20k tweets there seems to be. I think I'll just mess with the settings until I find the sweet spot now that I know for sure that it definitely works, and I'll let you know.

amitupreti · 2020-02-27T06:49:39Z

From what I have experienced. The most data I could pull from twitter for a single hashtag was around 5k.

It turns out if you browse twitter manually and keep scrolling they will stop showing tweets after a certain number(which seems to differ according to hashtags). I verified this manually.

so you could use large number of related hashtags or crawl frequently say in a few days. if the tag is popular let's say it gets 1000 tweets per day. Then you can end up with 10-20k in a week.

The best settings I found is concurrency=2 and download_delay=0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No longer scraping past the first page. #1

No longer scraping past the first page. #1

sparremberger commented Feb 26, 2020

amitupreti commented Feb 27, 2020 via email •

edited

sparremberger commented Feb 27, 2020

amitupreti commented Feb 27, 2020

sparremberger commented Feb 27, 2020 •

edited

amitupreti commented Feb 27, 2020 •

edited

sparremberger commented Feb 27, 2020

amitupreti commented Feb 27, 2020 •

edited

sparremberger commented Feb 27, 2020

amitupreti commented Feb 27, 2020

No longer scraping past the first page. #1

No longer scraping past the first page. #1

Comments

sparremberger commented Feb 26, 2020

amitupreti commented Feb 27, 2020 via email • edited

sparremberger commented Feb 27, 2020

amitupreti commented Feb 27, 2020

sparremberger commented Feb 27, 2020 • edited

amitupreti commented Feb 27, 2020 • edited

sparremberger commented Feb 27, 2020

amitupreti commented Feb 27, 2020 • edited

sparremberger commented Feb 27, 2020

amitupreti commented Feb 27, 2020

amitupreti commented Feb 27, 2020 via email •

edited

sparremberger commented Feb 27, 2020 •

edited

amitupreti commented Feb 27, 2020 •

edited

amitupreti commented Feb 27, 2020 •

edited