Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Code Addition Request]: Automate Workflows through Web Scraping #736

Closed
3 tasks done
sanchitc05 opened this issue Oct 20, 2024 · 5 comments · Fixed by #738
Closed
3 tasks done

[Code Addition Request]: Automate Workflows through Web Scraping #736

sanchitc05 opened this issue Oct 20, 2024 · 5 comments · Fixed by #738
Assignees
Labels
Contributor Denotes issues or PRs submitted by contributors to acknowledge their participation. gssoc-ext hacktoberfest level1 Status: Assigned💻 Indicates an issue has been assigned to a contributor.

Comments

@sanchitc05
Copy link
Contributor

Have you completed your first issue?

  • I have completed my first issue

Guidelines

  • I have read the guidelines
  • I have the link to my latest merged PR

Latest Merged PR Link

UppuluriKalyani/ML-Nexus#324

Project Description

Description:
I propose developing a web scraping tool to automate workflows within PyVerse. This tool will extract data (e.g., prices, news, or stock data) from static and dynamic websites, store it for analysis, and run periodically using schedulers like cron or Task Scheduler.


Tech Stack:

  • Python Libraries: requests, BeautifulSoup, Selenium
  • Scheduling: cron (Linux) / Task Scheduler (Windows)
  • Error Handling and Logs: logging module

Approach:

  1. Identify websites to scrape based on user needs (e.g., financial or product data).
  2. Build modules for scraping both static and dynamic pages.
  3. Implement automated scheduling with logging for status tracking.
  4. Create detailed documentation and examples to guide users.

How This Helps Users:

  • Automates repetitive data extraction tasks.
  • Saves time and ensures users always have the latest data for analysis.
  • Encourages contributions from beginners and advanced users alike through modular code and documentation.

Please assign this issue to me so I can begin implementation.

Full Name

Sanchit Chauhan

Participant Role

GSSOC, HACKTOBERFEST

Copy link

🙌 Thank you for bringing this issue to our attention! We appreciate your input and will investigate it as soon as possible.

Feel free to join our community on Discord to discuss more!

@UTSAVS26 UTSAVS26 added Contributor Denotes issues or PRs submitted by contributors to acknowledge their participation. Status: Assigned💻 Indicates an issue has been assigned to a contributor. level1 gssoc-ext hacktoberfest labels Oct 20, 2024
@sanchitc05
Copy link
Contributor Author

Hi @UTSAVS26 ,

Thank you for assigning the issue! I’d like to kindly request an update to a level 2 label. The project involves web scraping, which requires handling both static and dynamic content using requests, BeautifulSoup, and Selenium. Additionally, I'll implement error handling and logging to ensure robustness. Given these complexities, I believe a level 2 label would be more appropriate.

Thank you for your understanding!

@UTSAVS26
Copy link
Owner

Based on work done i will change the level.

@sanchitc05
Copy link
Contributor Author

Hi again @UTSAVS26 I am done with the work on this issue and pulled a request for the same. Please check it whenever you feel comfortable and let me know if you face any doubts regarding anything in the PR feel free to ask me.

Also can you please go through it well and based on my work, I would like to request you to please give me at least level 2 for the work.

Regards,
Sanchit Chauhan

UTSAVS26 added a commit that referenced this issue Oct 22, 2024
Fixes #736

## Pull Request for PyVerse 💡

### Requesting to submit a pull request to the PyVerse repository.

---

#### Issue Title
*Add Web Scraping Workflow Automation*

- [YES] I have provided the issue title.

---

#### Name 
*Sanchit Chauhan*

- [YES] I have provided my name.

---

#### GitHub ID 
*sanchitc05*

- [YES] I have provided my GitHub ID.

---

#### Email ID
*[email protected]*

- [YES] I have provided my email ID.

---

#### Identify Yourself
**Mention in which program you are contributing (e.g., WoB, GSSOC, SSOC,
SWOC).**
*GSSOC, HACKTOBERFEST*

- [YES] I have mentioned my participant role.

---

#### Closes  
*Closes: #736 *

- [YES] I have provided the issue number.

---

#### Describe the Add-ons or Changes You've Made
*### **Description**  
This PR introduces an automated web scraping workflow to extract data
from static and dynamic web pages. The solution uses `requests` and
`BeautifulSoup` for static pages, and `Selenium` for dynamic content.
The scraped data is logged for easy tracking and error management. This
feature streamlines repetitive data collection tasks and enables
automated scheduling for regular scraping.

### **Technical Implementation**  
- **Libraries Used**:  
  - `requests`: Fetch web pages for static content.  
  - `BeautifulSoup`: Parse and extract relevant data from HTML.  
  - `Selenium`: Automate browser interaction for dynamic content.  
  - **Logging Module**: Tracks activities and errors in `scraper.log`.

- **Project Structure**:  
  - `scraper.py`: Main script containing scraping logic.
  - `requirements.txt`: Dependency list for easy setup.

### **Usage**  
1. Clone the repository and install dependencies:
   ```bash
   git clone https://github.com/yourusername/web_scraper.git
   cd web_scraper
   pip install -r requirements.txt
   ```
2. Update `static_url` and `dynamic_url` variables in `scraper.py`.
3. Run the scraper:
   ```bash
   python scraper.py
   ```
4. Check logs in `scraper.log` for activity status.

### **Benefits**  
- **Automates data collection**, saving time and effort.
- **Handles dynamic content**, making it adaptable to complex websites.
- **Error tracking** ensures smooth, continuous scraping.

### **Testing**  
- Successfully tested scraping both static and dynamic pages.  
- Verified proper logging of activities and error handling.*

- [YES] I have described my changes.

---

#### Type of Change
**Select the type of change:**  
- [YES] Bug fix (non-breaking change which fixes an issue)
- [YES] New feature (non-breaking change which adds functionality)
- [YES] Code style update (formatting, local variables)
- [YES] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [YES] This change requires a documentation update

---

#### How Has This Been Tested?
**Describe how your changes have been tested.**  
*Describe your testing process here.*

- [YES] I have described my testing process.

---

#### Checklist
**Please confirm the following:**  
- [YES] My code follows the guidelines of this project.
- [YES] I have performed a self-review of my code.
- [YES] I have commented on my code, particularly wherever it was hard
to understand.
- [YES] I have made corresponding changes to the documentation.
- [YES] My changes generate no new warnings.
- [YES] I have added things that prove my fix is effective or that my
feature works.
- [NO] Any dependent changes have been merged and published in
downstream modules.
Copy link

✅ This issue has been closed. Thank you for your contribution! If you have any further questions or issues, feel free to join our community on Discord to discuss more!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributor Denotes issues or PRs submitted by contributors to acknowledge their participation. gssoc-ext hacktoberfest level1 Status: Assigned💻 Indicates an issue has been assigned to a contributor.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants