Skip to content

Discourse #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: working
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
1309c5e
Initial commit of Discourse crawler and sample data
jbaicoianu Oct 4, 2022
6a76f5b
Removed debug prints
jbaicoianu Oct 4, 2022
79feea3
Updated crawl site list
jbaicoianu Oct 5, 2022
229f19c
- Use target_dir for crawl output
jbaicoianu Oct 5, 2022
ea40840
Added "summary" and "topics" crawl types
jbaicoianu Oct 5, 2022
c5a247c
Crawl list cleanup
jbaicoianu Oct 10, 2022
b9f6b8a
Profiler support, set retry option based on crawl type
jbaicoianu Oct 10, 2022
3aab776
Improved crawl data structure, initial support for writing direct to …
jbaicoianu Oct 10, 2022
19dc7d0
Improved caching to make resuming a crawl more reliable and faster, i…
jbaicoianu Oct 19, 2022
959229c
Tweaked spider defaults, add quick summarize step for stats
jbaicoianu Oct 19, 2022
5f098ed
Fix for problematic URL structures, improved category caching, first-…
jbaicoianu Nov 2, 2022
ddecfc5
Reverted change to main codepile.py file
jbaicoianu Nov 4, 2022
3c6dd02
Whitespace
jbaicoianu Nov 4, 2022
9d414fa
First pass at discourse preprocessor to convert raw json into lm_data…
jbaicoianu Nov 8, 2022
dc026ce
Added processor support, added command line options
jbaicoianu Nov 8, 2022
6072e64
Updated dependencies
jbaicoianu Nov 8, 2022
5ef703a
Big improvements to pipeline flow, with examples
jbaicoianu Nov 9, 2022
73083fe
Improved usage examples
jbaicoianu Nov 9, 2022
50c77c1
Use temp sitepath for processing
jbaicoianu Nov 9, 2022
c123b55
Improved topic processor output
jbaicoianu Nov 11, 2022
aeca218
Improved subdir handling during processing
jbaicoianu Nov 12, 2022
344d5c5
Improved path handling, log missing posts and exceptions
jbaicoianu Nov 12, 2022
10de5f1
Handle sites with missing topic slugs
jbaicoianu Nov 12, 2022
419e48a
Improved metadata, support for paginated topics
jbaicoianu Nov 18, 2022
1dbbca3
Disable telnet server
jbaicoianu Dec 16, 2022
438e11d
Pass site parameter to get_additional_posts function
jbaicoianu Dec 16, 2022
145e77e
Handle licenses, use jsonl files instead of loose files
jbaicoianu Feb 17, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file modified codepile/codepile.py
100644 → 100755
Empty file.
Loading