Skip to content

EISALab/SI_News_Classifier

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Sustainability Indicators (SI) News Classifier

The source codes provided in this repository were developed as part of the research project presented in Advancing sustainability indicators through text mining: a feasibility demonstration. The project is licensed under The University of Illinois/NCSA Open Source License (NCSA) (READ License.txt) and it was last updated on June 25, 2014. The repository contains:

  • The m-code codes for the classification algorithm presented in Rivera et. al., (2013) (SI_News_Classifier.m and associated functions files).
  • The RapidMiner_5 workflow used to pre-process the data.
  • An illustrative example of its use and implementation.

Further development of the algorithm, suggestions, corrections and new examples are always encouraged and welcomed.

Citing this work

We kindly ask any future publications using this software to include a reference to the following publication: Rivera, S., Minsker, B., Work, D. and Roth, D. (2013) “Advancing sustainability indicators through text mining: a feasibility demonstration.” submitted to Environmental Modeling and Software.
Note: A python copy of the code will be released later this year (2014). The documentation will be changed to address any concerns related to the python copy.

Installation

Requirements

  1. The latest version of the source code was developed using Matlab (R2008a), however the code was tested and validated in later versions. A valid copy of the software and a license (can be educational) is needed.
  2. As stated in Rivera et. al., (2013), the implementation of the code requires the additional installation of the Dataless Classification software developed by the Cognitive Computation Group led by Professor Daniel Roth at the University of Illinois at Urbana-Champaign. The software is available at: Importance of Semantic Representation: Dataless Classification . Instructions on the installation and setup of the Dataless Classification can be found in the provided link. The Dataless Classification is integrated into the source codes by providing the path to the installation folder (e.g. '../descartes-0.2/bin/DESCARTES').

Install

To begin using the classification algorithm, download and integrate the SI_News_Classifier.m function into your code. Be aware that if the provided archiving structure is changed, further modifications could be required to the paths hard coded into the source code.

Example of implementation

An illustrative example has been provided with the source code to help users understand the structure of the input and output data. The example contains a total of 26 news articles divided in the following hierarchical tree:

alt text

Objective: Classify news articles under the sustainability indicators. As part of the illustrative example, the following files are included:

  • News_Data.xls - Metadata of news articles used for the illustrative example. Below, a list of the workbooks is presented:
    • News_Labels_and_Sources - Provides a list of the news sources and their labels
    • Words - List of all the words in the set of news articles obtained after pre-processing (tokenization and elimination of stop-words)
  • Example Data - Word-bag (binary and tf-dif) matrix, news labels, training and testing set
  • Example_1.m - Main m-file for the implementation example
Note: The set of news was pre-processed using the provided RapidMiner 5 workflow. Special attention should be paid to the order in which RapidMiner 5 and Matlab read the news articles as they are not necessarily the same.

The main script is the Example_1.m. Before running the script, the appropriate path to the folder containing the Dataless Classification (e.g. '../descartes-0.2/bin/DESCARTES') should be modified in the script.

Outputs:

  • Classification label of test set at all levels of the hierarchical tree
  • Classification confusion matrix for root and parent nodes.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published