Skip to content

YiHao990416/University_website_scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents

Project Description

This project is used to extract Chinese text information from Taiwan university websites

Installation

clone this github repository with the commandline below

git clone https://github.com/YiHao990416/University_website_scrapper.git/

Usage

Create files and directory required

To use this website scrapper program, please create the files and directory with the commandline below.

create folder named "output", "output_json" and "checkpoint" in the root of the folder

mkdir output
mkdir output_json
mkdir checkpoint
mkdir input

create empty .txt file named "error.txt" in the checkpoint folder. directory: ./checkpoint/error.txt

the website information and link is stored in the form of jsonl file in input folder directory: ./input/all_uni.json

the example format of the all_uni.jsonl files is shown below:

{"學校名稱": "國立中正大學", "網址": "http://www.ccu.edu.tw/", "abbrev": "ccu"} {"學校名稱": "國立宜蘭大學", "網址": "https://www.niu.edu.tw/", "abbrev": "niu"}

How to used the program

execute the program with the command below to extract all the chinese text from university website The amount of text extracted is based on the hyperparameter which can be adjusted in the main.py

The default is set to depth = 3 min_len = 50

run python main.py

To convert the extracted .txt files into jsonl format, execute the tools.py

run python tools.py

To check which university_name.txt is absent in output folder

run python error.py

Dependencies

The virtual environment is created with condaa. The dependencies is listed in requirements.txt. To install the dependencies, use the command line below

pip install -r requirements.txt

License

Please ensure that you follow the code of conduct and contribution guidelines when contributing to this project. License This project is licensed under the MIT License - see the LICENSE file for details

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages