GitHub

Project Description

This project is used to extract Chinese text information from Taiwan university websites

Installation

clone this github repository with the commandline below

git clone https://github.com/YiHao990416/University_website_scrapper.git/

Usage

Create files and directory required

To use this website scrapper program, please create the files and directory with the commandline below.

create folder named "output", "output_json" and "checkpoint" in the root of the folder

mkdir output
mkdir output_json
mkdir checkpoint
mkdir input

create empty .txt file named "error.txt" in the checkpoint folder. directory: ./checkpoint/error.txt

the website information and link is stored in the form of jsonl file in input folder directory: ./input/all_uni.json

the example format of the all_uni.jsonl files is shown below:

{"學校名稱": "國立中正大學", "網址": "http://www.ccu.edu.tw/", "abbrev": "ccu"} {"學校名稱": "國立宜蘭大學", "網址": "https://www.niu.edu.tw/", "abbrev": "niu"}

How to used the program

execute the program with the command below to extract all the chinese text from university website The amount of text extracted is based on the hyperparameter which can be adjusted in the main.py

The default is set to depth = 3 min_len = 50

run python main.py

To convert the extracted .txt files into jsonl format, execute the tools.py

run python tools.py

To check which university_name.txt is absent in output folder

run python error.py

Dependencies

The virtual environment is created with condaa. The dependencies is listed in requirements.txt. To install the dependencies, use the command line below

pip install -r requirements.txt

License

Please ensure that you follow the code of conduct and contribution guidelines when contributing to this project. License This project is licensed under the MIT License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
error.py		error.py
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
scrapper.py		scrapper.py
tools.py		tools.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Project Description

Installation

Usage

Create files and directory required

How to used the program

Dependencies

License

About

Releases

Packages

Languages

YiHao990416/University_website_scrapper

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Project Description

Installation

Usage

Create files and directory required

How to used the program

Dependencies

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages