- Introduction
- Features
- Installation and Basic Usage
- Parameters
- Rate Limits and Performance
- Returned JSON Object Structure
Warning
Usage (Classes and Functions) method has drastically changed and also the following README doc.
The old docs are in ./BAScraper_old/README_old.md.
This new v0.2.x-a is only tested to the extent that I personally use, so full coverage testing has not been done. It also hasn't been published to PyPi (PyPi on v0.1.2), manually download for the newest v0.2-a please report unexpected issues that may occur.
An API wrapper for PullPush.io and Arctic-Shift - the 3rd party replacement APIs for Reddit. Nothing special.
After the 2023 Reddit API controversy, PushShift.io(and also wrappers such as PSAW and PMAW) is now only available to reddit admins and Reddit PRAW is honestly useless when trying to get a lots of data and data from a specific timeframe. This aims to help with that since these 3rd party services didn't have any official/unofficial python wrappers.
- Asynchronous operations for better performance. (updated from the old multithreaded approach)
- Support for PullPush.io and Arctic Shift APIs.
- Parameter customization for subreddit, comment, and submission searches.
- Integrated rate-limit management.
- Parameter schemes for data selection.
Also, please respect cool-down times and refrain from requesting very large amount of data. It stresses the server and can cause inconvenience for everyone.
For large amounts of data, head to ArcticShift's academic torrent zst dumps
Links to the services:
you can install the package via pip
pip install BAScraperPython 3.11+ is required (asyncio.TaskGroup is used)
from BAScraper.BAScraper_async import PullPushAsync, ArcticShiftAsync
import asyncio
ppa = PullPushAsync(log_stream_level="DEBUG", task_num=2)
asa = ArcticShiftAsync(log_stream_level="DEBUG", task_num=10)
async def test1():
print('TEST 1-1 - PullPushAsync basic fetching')
result1 = await ppa.fetch(
mode='submissions',
subreddit='cars',
get_comments=True,
after='2024-07-01',
before='2024-07-01T06:00:00',
file_name='test1-1'
)
print('test 1 len:', len(result1))
print('\nTEST 1-2 - PullPushAsync basic comment fetching')
result2 = await ppa.fetch(
mode='comments',
subreddit='cars',
after='2024-07-01',
before='2024-07-01T06:00:00',
file_name='test1-2'
)
print('test 2 len:', len(result2))
async def test2():
print('TEST 2-1 - ArcticShiftAsync basic fetching')
result1 = await asa.fetch(
mode='submissions_search',
subreddit='cars',
# get_comments=True, # can be uncommented to the comment
after='2024-07-01',
before='2024-07-05T03:00:00',
file_name='test2-1',
fields=['created_utc', 'title', 'url', 'id'],
limit=0 # auto
)
print('test 1 len:', len(result1))
print('\nTEST 2-2 - ArcticShiftAsync basic comment fetching')
result2 = await asa.fetch(
mode='comments_search',
subreddit='cars',
body='bmw honda benz',
after='2024-07-01',
before='2024-07-01T12:00:00',
file_name='test2-2',
limit=100,
fields=['created_utc', 'body', 'id'],
)
print('test 2 len:', len(result2))
print('\nTEST 2-3 - ArcticShiftAsync subreddits_search')
result3 = await asa.fetch(
mode='subreddits_search',
subreddit_prefix='what',
file_name='test2-3',
limit=1000
)
print('test 3 len:', len(result3))
if __name__ == '__main__':
if input('test pullpush?: ') == 'y':
asyncio.run(test1())
if input('test arcticshift?: ') == 'y':
asyncio.run(test2())
# all results are saved to 'resultX.json' since the `file_name` field was specified.
# it'll save all the results in the current directory since `save_dir` wasn't specifiedNote
When using multiple requests, (as in multiple functions under PullPushAsync)
it is highly recommended to use all the functions under the same class
because all the request pool related variables would be shared in that case.
Also, when re-running scripts using this, pools recording the request status is reset every time. So keep in mind that unexpected soft/hard rate limits may occur when frequently (re-)running scripts. Consider waiting a few minutes or seconds before running scripts if needed.
For more info on each of the parameters as well as additional info (TOS, extra tools, etc) visit the following links:
for PullPushAsync.__init__ & ArcticShiftAsync.__init__
| Parameter | Type | Restrictions | Required | Default Value | Notes |
|---|---|---|---|---|---|
sleep_sec |
int |
Positive int | No | 1 |
Cooldown time between each request. |
backoff_sec |
int |
Positive int | No | 3 |
Backoff time for each failed request. |
max_retries |
int |
Positive int | No | 5 |
Number of retries for failed requests before it gives up. |
timeout |
int |
Positive int | No | 10 |
Time until it's considered as timeout error. |
pace_mode |
str |
One of 'auto-soft', 'auto-hard', 'manual' |
No | 'auto-hard' |
Sets the pace to mitigate rate-limiting. |
save_dir |
str |
Valid path | No | os.getcwd() (current directory) |
Directory to save the results. |
task_num |
int |
Positive int | No | 3 |
Number of async tasks to be made. |
log_stream_level |
str |
One of ['NOTSET', 'DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'] |
No | 'INFO' |
Sets the log level for logs streamed on the terminal. |
log_level |
str |
Same as log_stream_level |
No | 'DEBUG' |
Sets the log level for logging (file). |
duplicate_action |
str |
One of 'keep_newest', 'keep_oldest', 'remove', 'keep_original', 'keep_removed' |
No | 'keep_newest' |
Decides handling of duplicates. |
| Parameter | Type | Restrictions | Required | Notes |
|---|---|---|---|---|
q |
str |
Quoted string for phrases | No | Search query for comments or submissions. |
ids |
list |
Maximum length: 100 | No | List of IDs to fetch. |
size |
int |
Must be <= 100 | No | Number of results to return. |
sort |
str |
Must be one of "asc", "desc" |
No | Sorting order. |
sort_type |
str |
Must be one of "score", "num_comments", "created_utc" |
No | Sorting criteria. |
author |
str |
None | No | Filter by author. |
subreddit |
str |
None | No | Filter by subreddit. |
after |
str |
Must be in ISO8601 format, converted to epoch | No | Include results after this date. |
before |
str |
Must be in ISO8601 format, converted to epoch | No | Include results before this date. |
-
'comments':Parameter Type Restrictions Required Notes link_idstrNone No Fetch comments under a specific post. -
'submissions':Parameter Type Restrictions Required Notes titlestrNone No Search query for titles. selftextstrNone No Search query for selftext fields. scorestrMust satisfy a comparison operation ( >, <, =)No Filter by score. num_commentsstrMust satisfy a comparison operation ( >, <, =)No Filter by number of comments. over_18boolTrueorFalseNo Include or exclude NSFW content. is_videoboolTrueorFalseNo Include or exclude video submissions. lockedboolTrueorFalseNo Include or exclude locked submissions. stickiedboolTrueorFalseNo Include or exclude stickied submissions. spoilerboolTrueorFalseNo Include or exclude spoiler-tagged submissions. contest_modeboolTrueorFalseNo Include or exclude contest-mode submissions.
| Parameter | Type | Restrictions | Required | Notes |
|---|---|---|---|---|
mode |
str |
Varies based on endpoint | Yes | Specifies the type of data to fetch. Options include submissions_id_lookup, comments_id_lookup, etc. |
get_comments |
bool |
True or False |
No | If True, fetch comments associated with submissions. |
file_name |
str |
Valid filename or None |
No | Saves the results to a specified file. |
**params |
dict |
See mode-specific parameters. | Yes | Additional parameters dependent on the mode. |
-
ID Lookup
-
Modes:
submissions_id_lookup,comments_id_lookup,subreddits_id_lookup,users_id_lookup. -
Parameters:
Parameter Type Restrictions Required Notes idslistMaximum length: 500 Yes List of IDs to fetch. md2htmlboolTrueorFalseNo If True, converts markdown to HTML.fieldslistValid field names for entity No Specific fields to include in results.
-
-
Search
-
Modes:
submissions_search,comments_search.-
Common Parameters
Parameter Type Restrictions Required Notes authorstrNone No Filter results by author. subredditstrNone No Filter results by subreddit. author_flair_textstrNone No Filter by author's flair text. afterstrISO8601 No Include results after this date. beforestrISO8601 No Include results before this date. limitint0 <= x <= 100; 0-> autoNo Number of results per request; 0automatically adjusts based on server capacity.sortstrMust be 'asc'or'desc'No Order results by the specified criteria. md2htmlboolTrueorFalseNo If True, converts markdown to HTML. -
Submissions
Parameter Type Restrictions Required Notes authorstrNone No Filter by author. subredditstrNone No Filter by subreddit. querystrNone No Search query (title and selftext). limitint1-100 No Number of results per request. sortstr'asc'or'desc'No Sorting order. -
Comments
Parameter Type Restrictions Required Notes bodystrNone No Filter by comment body text. link_idstrNone No Fetch comments under a specific post. authorstrNone No Filter by comment author.
-
-
subreddits_searchParameter Type Restrictions Required Notes subredditstrNone No Filter results by a specific subreddit. subreddit_prefixstrNone No Search for subreddits starting with a prefix. afterstrMust be in ISO8601 format No Include results after this date. beforestrMust be in ISO8601 format No Include results before this date. min_subscribersintMust be >= 0 No Minimum number of subscribers. max_subscribersintMust be >= 0 No Maximum number of subscribers. over18boolTrueorFalseNo Include or exclude NSFW subreddits. limitint1 <= x <= 1000 No Limit the number of results returned. sortstrMust be 'asc'or'desc'No Sort results in ascending or descending order. sort_typestrMust be one of 'created_utc','subscribers','subreddit'No Sorting criteria. -
users_searchParameter Type Restrictions Required Notes authorstrNone No Filter results by a specific user. author_prefixstrNone No Search for users starting with a prefix. min_num_postsintMust be >= 0 No Minimum number of posts by the user. min_num_commentsintMust be >= 0 No Minimum number of comments by the user. active_sincestrMust be in ISO8601 format No Include users active since this date. min_karmaintMust be >= 0 No Minimum karma required for users. limitint1 <= x <= 1000 No Limit the number of results returned. sortstrMust be 'asc'or'desc'No Sort results in ascending or descending order. sort_typestrMust be one of 'author','total_karma'No Sorting criteria for users. -
comments_tree_searchParameter Type Restrictions Required Notes link_idstrNone Yes Fetch comments under the specified post ID. parent_idstrNone No Fetch replies under a specific parent comment. limitintMust be >= 1 No Maximum number of comments to return. start_breadthintMust be >= 0 No Threshold for collapsing comments based on breadth. start_depthintMust be >= 0 No Threshold for collapsing comments based on depth. md2htmlboolTrueorFalseNo If True, converts markdown to HTML in the results.fieldslistValid field names for submissions No Include specific fields in the returned data.
-
-
Aggregations
-
Modes:
submissions_aggregation,comments_aggregation.-
Common Parameters
Parameter Type Restrictions Required Notes aggregatestrMust be one of 'created_utc','author','subreddit'Yes Specifies the field to group by. frequencystrNone No Required only when aggregateis'created_utc'. Defines time intervals for aggregation.limitintMust be >= 1 No Limits the number of grouped results returned. min_countintMust be >= 0 No Minimum number of entries in a group; not applicable when aggregateis'created_utc'.sortstrMust be 'asc'or'desc'No Sorts the aggregation results. -
Submissions
Parameter Type Restrictions Required Notes crosspost_parent_idstrNone No Filters by crosspost parent ID. over_18boolTrueorFalseNo Includes or excludes NSFW content. spoilerboolTrueorFalseNo Includes or excludes spoiler-tagged submissions. titlestrNone No Filters by title. selftextstrNone No Filters by selftext field. link_flair_textstrNone No Filters by link flair text. querystrNone No Searches across title and selftext. urlstrNone No Filters by URL prefix (e.g., YouTube). url_exactboolTrueorFalseNo If True, requires exact URL match.fieldslistValid field names for submissions No Filters the fields included in results. -
Comments
Parameter Type Restrictions Required Notes bodystrNone No Filters by comment body text. link_idstrNone No Filters by submission ID. parent_idstrNone No Filters by parent comment ID. fieldslistValid field names for comments No Filters the fields included in results.
-
-
user_flairs_aggregationParameter Type Restrictions Required Notes authorstrNone Yes Specifies the user for whom to aggregate flairs across subreddits.
-
-
Interactions
-
user_interactions,list_users_interactionsParameter Type Restrictions Required Notes authorstrNone Yes Specifies the user for whom interactions are queried. subredditstrNone No Filter interactions by a specific subreddit. afterstrISO8601 No Include interactions after this date. beforestrISO8601 No Include interactions before this date. min_countintMust be >= 0 No Minimum number of interactions to include. limitintMust be >= 1 No Maximum number of results to return. -
subreddits_interactionsParameter Type Restrictions Required Notes authorstrNone Yes Specifies the user for whom interactions are queried. weight_postsfloatMust be >= 0 No Weight assigned to posts in interaction calculations. weight_commentsfloatMust be >= 0 No Weight assigned to comments in interaction calculations. beforestrISO8601 No Include interactions before this date. afterstrISO8601 No Include interactions after this date. min_countintMust be >= 0 No Minimum number of interactions to include. limitintMust be >= 1 No Maximum number of results to return.
-
both services are free and don't have any SLO or SLA "as good as it is now"
PullPush API implemented ratelimiting as of Feb. 2024.
soft limit will occur after 15 req/min and hard limit after 30 req/min. There's also a long-term (hard) limit of 1000 req/hr.
Recommended request pacing:
- to prevent soft-limit: 4 sec sleep per request
- to prevent hard-limit: 2 sec sleep per request
- for 1000+ requests: 3.6 ~ 4 sec sleep per request
due to lowered performance recently, using a single worker(task_num) is recommended unless it's for a short burst.
rate limiting will automatically pace your request's response time to meet the following hard limits.
But pace_mode would still do throttling just in case. Following the pacing time above is recommended.
Warning
The long-term hard ratelimit of 1000 req/hr is not implemented in the auto throttling.
You should manually set sleep second using the sleepsec param for PullPushAsync.__init__ to 3.6 ~ 4 sec as mentioned above
Dynamically returns x-ratelimit related headers in the response data.
BAScraper will read this and throttle if needed.
As of writing it has better performance and ratelimit rates compared to PullPush
- hard limit is usually 2000 requests per 1 minute, though it may vary.
- If
autois used for thelimitit may return more than 100 results per requests (though that would also vary) - up to 10 ~ 20 workers(ex:
task_num=10) still seem to hold up well - response time usually around a second with complex large queries over 5 seconds
While Arctic-Shift has better performance for simple queries, for complex queries, PullPush does perform better. Also, PullPush can do a Reddit-wide FTS(Full Text Search) but Arctic-Shift can only do FTS for a user or a certain subreddit. So choose the service depending on your needs.
the fetch function each returns a dict object that is indexed based on its unique reddit submission/comment/user/subreddit ID.
It is sorted in the order you specified when scraping
(the sort parameter).
So the general structure looks like this (regardless of it being a submission or a comment):
{
"21jh54" : {
"approved_at_utc": null,
"subreddit": "Cars",
"selftext": "",
"author_fullname": "t2_culcgvve",
"saved": false,
"mod_reason_title": null,
"gilded": 0,
"clicked": false,
"title": "something something",
...
},
"54jp5i" : {
"approved_at_utc": null,
"subreddit": "Cars",
"selftext": "",
"author_fullname": "t2_kdbbiwo",
"saved": false,
"mod_reason_title": null,
"gilded": 0,
"clicked": false,
"title": "something something",
...
},
...
}if the get_comments parameter is set to True the returned result would look like this (for submissions)
{
"21jh54": {
"approved_at_utc": null,
"subreddit": "Cars",
"selftext": "",
"author_fullname": "t2_culcgvve",
"saved": false,
"mod_reason_title": null,
"gilded": 0,
"clicked": false,
"title": "something something",
"comments": [
{
...
info related to comments
...
},
...
],
...
},
...
}