This repository contains a Python script designed to scrape exam question links from the Exam Topics website based on a specific provider and exam code. The script is optimized to run in Google Colab and uses parallel requests for efficient data fetching.
- Scrapes exam question links from Exam Topics for a specified provider.
- Filters the links based on a given exam code.
- Uses multithreading to speed up the scraping process.
- Groups and saves the exam question links in a structured text file.
To get started, open this script in Google Colab using the following link:
Make sure to install the necessary Python libraries before running the script:
!pip install requests beautifulsoup4 tqdm
Get the code from main.ipynb
or main.py
and run it in a Colab cell.
The script is divided into several parts, each with a specific purpose:
import requests
from bs4 import BeautifulSoup
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
- requests: For making HTTP requests.
- BeautifulSoup: For parsing HTML content.
- re: For regular expression operations.
- ThreadPoolExecutor: For parallel execution of tasks.
- tqdm: For displaying progress bars.
class Scraper:
def __init__(self, provider):
self.session = requests.Session()
self.provider = provider.lower()
self.base_url = f"https://www.examtopics.com/discussions/{self.provider}/"
- Initializes a session for making requests.
- Sets the base URL based on the provided exam provider.
def get_num_pages(self):
try:
response = self.session.get(f"{self.base_url}")
soup = BeautifulSoup(response.content, "html.parser")
return int(soup.find("span", {"class": "discussion-list-page-indicator"}).find_all("strong")[1].text.strip())
except Exception as e:
print(f"Error fetching page count: {e}")
return 0
- Fetches and parses the number of pages available for the provider.
def fetch_page_links(self, page, search_string):
try:
response = self.session.get(f"{self.base_url}{page}/")
soup = BeautifulSoup(response.content, "html.parser")
discussions = soup.find_all("a", {"class": "discussion-link"})
links = [
discussion["href"].replace("/discussions", "https://www.examtopics.com/discussions", 1)
for discussion in discussions if search_string in discussion.text
]
return links
except Exception as e:
print(f"\nError on page {page}: {e}")
return []
- Fetches and filters exam question links from a specific page.
def get_discussion_links(self, num_pages, search_string):
links = []
with ThreadPoolExecutor(max_workers=10) as executor:
futures = [executor.submit(self.fetch_page_links, page, search_string) for page in range(1, num_pages + 1)]
with tqdm(total=num_pages, desc="Fetching Links", unit="page") as pbar:
for future in as_completed(futures):
page_links = future.result()
links.extend(page_links)
pbar.update(1)
return links
- Uses parallel requests to fetch exam question links from multiple pages simultaneously.
def extract_topic_question(link):
match = re.search(r'topic-(\d+)-question-(\d+)', link)
return (int(match.group(1)), int(match.group(2))) if match else (None, None)
- Extracts topic and question numbers from the exam question link.
def write_grouped_links_to_file(filename, links):
grouped_links = {}
for link in sorted(links, key=extract_topic_question):
topic, question = extract_topic_question(link)
grouped_links.setdefault(topic, []).append(link)
with open(filename, 'w') as f:
for topic, links in grouped_links.items():
f.write(f'Topic {topic}:\n')
for link in links:
f.write(f' - {link}\n')
print(f"Topic {topic} links added to file.")
- Groups and writes the exam question links to a file based on their topic.
def main():
provider = input("Enter provider name: ")
scraper = Scraper(provider)
num_pages = scraper.get_num_pages()
print("Total Pages:", num_pages)
if num_pages > 0:
search_string = input("Enter exam code or enter QUIT to exit: ").upper()
if search_string != 'QUIT':
links = scraper.get_discussion_links(num_pages, search_string)
filename = f'{search_string} dumps.txt'
print(f"\nYour file will be named {filename}")
write_grouped_links_to_file(filename, links)
print("File generation complete.")
else:
return
else:
print("No pages found for the provider.")
if __name__ == "__main__":
main()
- Prompts the user for the provider and exam code (put the correct exam code).
- Fetches and processes the exam question links.
- Saves the links to a text file.