Language Identification with NLTK

This Python script uses the NLTK library along with langid to identify the languages in a given text and count the number of sentences in each language. It reads text from a DOCX file, tokenizes it into sentences, identifies the language of each sentence, and provides a count for each identified language.

Requirements

Python 3.x
NLTK
langid
python-docx

Install the required Python packages using the following command:

pip install langid nltk python-docx

Usage

Clone the repository:

git clone https://github.com/izadorapimenta/doclang.git
cd doclang

Replace the docx_file_path variable in the script with the path to your DOCX file.
Run the script:

python doclang.py

Example

Suppose you have a DOCX file (doclang.docx) with mixed Portuguese and Italian sentences. After running the script, it will output:

Number of sentences in pt: 2
Number of sentences in it: 1

Feel free to customize the example text in the script for your specific use case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Language Identification with NLTK

Requirements

Usage

Example

Files

README.md

Latest commit

History

README.md

File metadata and controls

Language Identification with NLTK

Requirements

Usage

Example