Crawler support #138

kursataktas · 2025-02-22T10:41:20Z

No description provided.

…on_owner

…ferences

…d state management

Task/selfhosted fixes

…evelop

* Refactor settings retrieval in requester and views using get_default_settings utility * Fix my_gurus view 404 handling for empty guru queries * Fix widget id interactivity on selfhosted + move its error messages to toasts * Add missing file to previous commit * Remove unused imports * Refactor navigation items into a centralized configuration

* update changelog * update docker-compoer - new version tag - discordListener service * update self hosted installation sh & installation readme - docker & docker compose version checks * fix indentations * Clear cache before the build on docker buildx

- Update install.md to reflect the changes in v0.2.0

* Add basic support for crawling and rotating proxies * Fix backend crawl auth * Add crawling support for frontend * Add crawl reset only for the crawled urls * Enhance MonacoUrlEditor with improved crawl state management - Add state variables for crawl processing, starting, and stopping - Implement dynamic button content based on crawl state - Make editor read-only during crawling - Add error handling for crawl start and stop actions * Fix crawler finishing case * Improve crawler status handling and error reporting - Add handling for "COMPLETED", "STOPPED", and "FAILED" crawl statuses - Enhance toast notifications for different crawl outcomes - Clear polling interval on final crawl states - Improve user feedback with more descriptive status messages * Refine internal link spider crawling logic - Add extensive proxy list for random selection - Remove unnecessary logging and sleep calls - Improve link limit handling with explicit status update - Optimize crawl state management when link limit is reached * Refactor internal link spider link limit handling - Move link limit check before logging to prevent unnecessary processing - Explicitly set crawl state to FAILED when link limit is exceeded - Add detailed error message for link limit scenario - Simplify link limit handling logic * Fix crawler error message handling - Update error message key from 'error' to 'error_message' - Ensure correct error reporting in crawler toast notifications * Remove unused import * Simplify useCrawler hook by removing unused parameters - Remove unnecessary state management parameters from useCrawler hook - Clean up unused dependencies in NewGuru component - Streamline crawler hook interface * Add proxy management and synchronization features - Implement Proxy model for storing and managing proxy servers - Create WebshareRequester to interact with Webshare proxy API - Add tasks for syncing and checking proxy health - Implement proxy synchronization and validation methods - Add admin interface for Proxy model - Include periodic tasks for proxy management * Fix "rendered more hooks than the previous render" issue on reload * Remove unnecessary toasts * Shorten "Start Crawling" and "Stop Crawling" button labels to "Crawl" and "Stop" * Remove Proxy model and related functionality - Delete Proxy model from models.py - Remove ProxyAdmin from admin.py - Simplify proxy-related methods in proxy.py - Update data_sources.py to use new proxy retrieval method - Remove commented-out proxy management code - Update WebshareRequester to return only valid proxies * Fix crawler state management in useCrawler hook Ensure crawler state is reset when an error occurs during crawling * Fix crawl ui on mobile * Revert "Crawl" button label to "Start Crawling" * Enhance tooltip positioning with side prop Add support for left and right tooltip positioning in TooltipContent component * Add crawl stop confirmation dialog to SourceDialog Implement a new confirmation dialog for stopping the crawling process in the SourceDialog component. The dialog provides a clear user experience when attempting to close the dialog during an active crawl, allowing users to either continue or stop the crawling process. * Add auto-scroll and editor ref for Monaco URL editor during crawling Implement auto-scrolling functionality in the MonacoUrlEditor component to automatically scroll to the latest line when new URLs are added during crawling. Add an editor ref to enable precise line scrolling and improve the user experience during the crawling process. * Improve URL discovery and deduplication in useCrawler hook Enhance URL crawling mechanism to prevent duplicate URL additions and provide a more robust discovery process. Implement a Set-based tracking of discovered URLs to ensure unique URL collection and improve the overall crawling experience. * Refactor web crawling mechanism using multiprocessing Modify the internal link spider to use multiprocessing instead of Crochet for running crawlers. Implement a new approach with: - Separate process for running spider - Simplified crawl state management - Improved error handling - Removed complex deferred callback logic * Remove unused proxy tasks * Extract CrawlStopConfirmationDialog to a separate component Refactor the SourceDialog by moving the CrawlStopConfirmationDialog to its own component file. Add support for different stop actions (close or stop crawling) with improved state management and flexibility. * Prevent duplicate URL additions in NewGuru component Enhance URL input handling by filtering out duplicate URLs before adding them to the editor. Ensure only unique URLs are appended to the existing list, improving the user experience and preventing redundant entries. * Add environment-specific proxy handling for InternalLinkSpider Modify the InternalLinkSpider to conditionally use proxies based on the environment setting. For self-hosted environments, disable proxy usage and adjust link limit constraints to improve flexibility and compatibility. * Conditionally apply proxy settings in InternalLinkSpider Modify proxy handling in InternalLinkSpider to use environment-specific configuration. For self-hosted environments, remove proxy usage while maintaining a consistent download timeout across different deployment scenarios. * Remove console logs * Remove log * Add a crawling delay for selfhosted * Add guru-specific crawling with user and guru type tracking Implement crawl state tracking with guru-specific context: - Update URLs to include guru slug for crawl endpoints - Add guru_type and user fields to CrawlState model - Modify crawl views to validate and associate crawls with specific gurus - Update serializers and frontend actions to support guru-specific crawling - Add corresponding database migrations for new model fields * Add error handling and state management for InternalLinkSpider initialization Enhance InternalLinkSpider with robust error handling during initialization: - Wrap initialization in try-except block - Log detailed error information - Update CrawlState status to FAILED on initialization errors - Set error message and end time for failed crawl states * Reduce crawling delay for selfhosted environment Decrease the sleep time between link crawls from 0.2 to 0.1 seconds in the selfhosted environment to potentially improve crawling performance. * Support selfhosted crawling by allowing null user in CrawlState - Add null=True and blank=True to CrawlState user field - Update __str__ method to handle cases without a user - Modify start_crawl view to set user to None for unauthenticated requests - Add corresponding database migration * Delete proxy code * Add API endpoints for crawl management with CrawlService - Refactor crawl-related views to use a centralized CrawlService - Create separate API and admin endpoints for starting, stopping, and checking crawl status - Implement consistent validation and error handling across different authentication methods - Update URL routing to support both API key and JWT authentication for crawl operations * Set default link_limit in start_crawl method Remove redundant link_limit parameter from views by setting a default value of 1500 in the CrawlService method, simplifying the crawl initiation process * Enhance crawl management with URL validation and error handling - Add URL format validation in start_crawl method using regex - Implement comprehensive error handling in crawl-related views - Update guru type validation to raise NotFoundError - Modify API and admin views to catch and return exceptions - Improve error response consistency across crawl endpoints * Remove user field from CrawlState serializer Simplify CrawlState serialization by removing the user field, which is consistent with recent changes supporting selfhosted crawling where the user can be null * Fix spacing/formatting * Remoev unused package * Fix useCrawler hook error handling and URL discovery logic Reorder error handling and URL discovery checks in useCrawler hook to ensure proper state management and error propagation * Refactor useCrawler hook to manage crawl input state Add state management for crawl input, URL, and related UI interactions in the useCrawler hook and components * Hide sitemap input after successful URL validation Close the sitemap input field after successfully validating a URL in the MonacoUrlEditor component * Reset crawl input state when closing SourceDialog Ensure crawl input and URL are reset when closing the dialog, preventing stale state between interactions

aralyekta and others added 24 commits February 17, 2025 14:46

Modify API key filtering to use integration flag instead of integrati…

c108363

…on_owner

Refactor API Keys page to fetch keys client-side

993e64c

Add 404 response when no gurus are found in my_gurus view

108d95b

Reduce analytics cache TTL from 5 minutes to 15 seconds

b4d4970

Refactor date range truncation in analytics utils and views

db2d588

Exclude duplicated binge root questions while counting data source re…

1e80a04

…ferences

Refactor NewGuru component to fetch data sources dynamically

b81ca7a

Enhance NewGuru component with dynamic guru data fetching and improve…

81f71a4

…d state management

Optimize NewGuru component with improved polling and state management

00d5266

Remove logs

fb4d654

Merge pull request #115 from Gurubase/task/selfhosted-fixes

7056fd7

Task/selfhosted fixes

Merge branch 'develop' of https://github.com/Gurubase/gurubase into d…

7d7a774

…evelop

Fix base urls in env

fca968f

Fix selfhosted request caching by adding no-store option (#130)

9354336

Self-hosted version 0.2.0 (#125)

a731f14

* update changelog * update docker-compoer - new version tag - discordListener service * update self hosted installation sh & installation readme - docker & docker compose version checks * fix indentations * Clear cache before the build on docker buildx

Update install.md

9ea70de

Update install.md

cd5f2f5

- version info added to gurubase.sh

b656d35

- Update install.md to reflect the changes in v0.2.0

fix log extraction commands (#136)

e89efa3

Merge branch 'master' into develop

9e550e8

Update bug_report.md

0e61403

change celery pool from default to threads

fb5f9b9

kursataktas merged commit 61b3624 into master Feb 23, 2025
4 checks passed

kursataktas deleted the develop branch February 24, 2025 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler support #138

Crawler support #138

kursataktas commented Feb 22, 2025

Crawler support #138

Crawler support #138

Conversation

kursataktas commented Feb 22, 2025