-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler support #138
Merged
Merged
Crawler support #138
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…d state management
Task/selfhosted fixes
* Refactor settings retrieval in requester and views using get_default_settings utility * Fix my_gurus view 404 handling for empty guru queries * Fix widget id interactivity on selfhosted + move its error messages to toasts * Add missing file to previous commit * Remove unused imports * Refactor navigation items into a centralized configuration
* update changelog * update docker-compoer - new version tag - discordListener service * update self hosted installation sh & installation readme - docker & docker compose version checks * fix indentations * Clear cache before the build on docker buildx
- Update install.md to reflect the changes in v0.2.0
* Add basic support for crawling and rotating proxies * Fix backend crawl auth * Add crawling support for frontend * Add crawl reset only for the crawled urls * Enhance MonacoUrlEditor with improved crawl state management - Add state variables for crawl processing, starting, and stopping - Implement dynamic button content based on crawl state - Make editor read-only during crawling - Add error handling for crawl start and stop actions * Fix crawler finishing case * Improve crawler status handling and error reporting - Add handling for "COMPLETED", "STOPPED", and "FAILED" crawl statuses - Enhance toast notifications for different crawl outcomes - Clear polling interval on final crawl states - Improve user feedback with more descriptive status messages * Refine internal link spider crawling logic - Add extensive proxy list for random selection - Remove unnecessary logging and sleep calls - Improve link limit handling with explicit status update - Optimize crawl state management when link limit is reached * Refactor internal link spider link limit handling - Move link limit check before logging to prevent unnecessary processing - Explicitly set crawl state to FAILED when link limit is exceeded - Add detailed error message for link limit scenario - Simplify link limit handling logic * Fix crawler error message handling - Update error message key from 'error' to 'error_message' - Ensure correct error reporting in crawler toast notifications * Remove unused import * Simplify useCrawler hook by removing unused parameters - Remove unnecessary state management parameters from useCrawler hook - Clean up unused dependencies in NewGuru component - Streamline crawler hook interface * Add proxy management and synchronization features - Implement Proxy model for storing and managing proxy servers - Create WebshareRequester to interact with Webshare proxy API - Add tasks for syncing and checking proxy health - Implement proxy synchronization and validation methods - Add admin interface for Proxy model - Include periodic tasks for proxy management * Fix "rendered more hooks than the previous render" issue on reload * Remove unnecessary toasts * Shorten "Start Crawling" and "Stop Crawling" button labels to "Crawl" and "Stop" * Remove Proxy model and related functionality - Delete Proxy model from models.py - Remove ProxyAdmin from admin.py - Simplify proxy-related methods in proxy.py - Update data_sources.py to use new proxy retrieval method - Remove commented-out proxy management code - Update WebshareRequester to return only valid proxies * Fix crawler state management in useCrawler hook Ensure crawler state is reset when an error occurs during crawling * Fix crawl ui on mobile * Revert "Crawl" button label to "Start Crawling" * Enhance tooltip positioning with side prop Add support for left and right tooltip positioning in TooltipContent component * Add crawl stop confirmation dialog to SourceDialog Implement a new confirmation dialog for stopping the crawling process in the SourceDialog component. The dialog provides a clear user experience when attempting to close the dialog during an active crawl, allowing users to either continue or stop the crawling process. * Add auto-scroll and editor ref for Monaco URL editor during crawling Implement auto-scrolling functionality in the MonacoUrlEditor component to automatically scroll to the latest line when new URLs are added during crawling. Add an editor ref to enable precise line scrolling and improve the user experience during the crawling process. * Improve URL discovery and deduplication in useCrawler hook Enhance URL crawling mechanism to prevent duplicate URL additions and provide a more robust discovery process. Implement a Set-based tracking of discovered URLs to ensure unique URL collection and improve the overall crawling experience. * Refactor web crawling mechanism using multiprocessing Modify the internal link spider to use multiprocessing instead of Crochet for running crawlers. Implement a new approach with: - Separate process for running spider - Simplified crawl state management - Improved error handling - Removed complex deferred callback logic * Remove unused proxy tasks * Extract CrawlStopConfirmationDialog to a separate component Refactor the SourceDialog by moving the CrawlStopConfirmationDialog to its own component file. Add support for different stop actions (close or stop crawling) with improved state management and flexibility. * Prevent duplicate URL additions in NewGuru component Enhance URL input handling by filtering out duplicate URLs before adding them to the editor. Ensure only unique URLs are appended to the existing list, improving the user experience and preventing redundant entries. * Add environment-specific proxy handling for InternalLinkSpider Modify the InternalLinkSpider to conditionally use proxies based on the environment setting. For self-hosted environments, disable proxy usage and adjust link limit constraints to improve flexibility and compatibility. * Conditionally apply proxy settings in InternalLinkSpider Modify proxy handling in InternalLinkSpider to use environment-specific configuration. For self-hosted environments, remove proxy usage while maintaining a consistent download timeout across different deployment scenarios. * Remove console logs * Remove log * Add a crawling delay for selfhosted * Add guru-specific crawling with user and guru type tracking Implement crawl state tracking with guru-specific context: - Update URLs to include guru slug for crawl endpoints - Add guru_type and user fields to CrawlState model - Modify crawl views to validate and associate crawls with specific gurus - Update serializers and frontend actions to support guru-specific crawling - Add corresponding database migrations for new model fields * Add error handling and state management for InternalLinkSpider initialization Enhance InternalLinkSpider with robust error handling during initialization: - Wrap initialization in try-except block - Log detailed error information - Update CrawlState status to FAILED on initialization errors - Set error message and end time for failed crawl states * Reduce crawling delay for selfhosted environment Decrease the sleep time between link crawls from 0.2 to 0.1 seconds in the selfhosted environment to potentially improve crawling performance. * Support selfhosted crawling by allowing null user in CrawlState - Add null=True and blank=True to CrawlState user field - Update __str__ method to handle cases without a user - Modify start_crawl view to set user to None for unauthenticated requests - Add corresponding database migration * Delete proxy code * Add API endpoints for crawl management with CrawlService - Refactor crawl-related views to use a centralized CrawlService - Create separate API and admin endpoints for starting, stopping, and checking crawl status - Implement consistent validation and error handling across different authentication methods - Update URL routing to support both API key and JWT authentication for crawl operations * Set default link_limit in start_crawl method Remove redundant link_limit parameter from views by setting a default value of 1500 in the CrawlService method, simplifying the crawl initiation process * Enhance crawl management with URL validation and error handling - Add URL format validation in start_crawl method using regex - Implement comprehensive error handling in crawl-related views - Update guru type validation to raise NotFoundError - Modify API and admin views to catch and return exceptions - Improve error response consistency across crawl endpoints * Remove user field from CrawlState serializer Simplify CrawlState serialization by removing the user field, which is consistent with recent changes supporting selfhosted crawling where the user can be null * Fix spacing/formatting * Remoev unused package * Fix useCrawler hook error handling and URL discovery logic Reorder error handling and URL discovery checks in useCrawler hook to ensure proper state management and error propagation * Refactor useCrawler hook to manage crawl input state Add state management for crawl input, URL, and related UI interactions in the useCrawler hook and components * Hide sitemap input after successful URL validation Close the sitemap input field after successfully validating a URL in the MonacoUrlEditor component * Reset crawl input state when closing SourceDialog Ensure crawl input and URL are reset when closing the dialog, preventing stale state between interactions
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.