Language Identification with NLTK

This Python script uses the NLTK library along with langid to identify the languages in a given text and count the number of sentences in each language. It reads text from a DOCX file, tokenizes it into sentences, identifies the language of each sentence, and provides a count for each identified language.

Requirements

Python 3.x
NLTK
langid
python-docx

Install the required Python packages using the following command:

pip install langid nltk python-docx

Usage

Clone the repository:

git clone https://github.com/izadorapimenta/doclang.git
cd doclang

Replace the docx_file_path variable in the script with the path to your DOCX file.
Run the script:

python doclang.py

Example

Suppose you have a DOCX file (doclang.docx) with mixed Portuguese and Italian sentences. After running the script, it will output:

Number of sentences in pt: 2
Number of sentences in it: 1

Feel free to customize the example text in the script for your specific use case.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
doclang.docx		doclang.docx
doclang.py		doclang.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Identification with NLTK

Requirements

Usage

Example

About

Releases

Packages

Languages

License

izadorapimenta/doclang

Folders and files

Latest commit

History

Repository files navigation

Language Identification with NLTK

Requirements

Usage

Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages