Skip to content

Michael Pajewski, Buckley Dowdle, Steve Morris, and Kip McCharen analyzed the UC Irvine Machine Learning Repo and developed an analysis dataset which anyone can access here.

kipmccharen/UC_Irvine_Dataset_MetaAnalysis

Repository files navigation

UC Irvine Datasets

UCI_MLR

class UC_Irvine_datasets()

The UC_Irvine_datasets() object contains a pandas dataframe of all the datasets available on the UC-Irvine Machine Learning Repository. Many methods make it easy to peruse, export, and even import the datasets inside the object.

See below for examples of ways to use methods of UC_Irvine_datasets():

from ucidata import UC_Irvine_datasets, df_first_row_to_header

# Create an instance of the class, which loads the dataframe of UC Irvine datasets
ucid = UC_Irvine_datasets()

string representation

The string property allows users to understand the current state of the class object.

print(ucid)

1_print_object

object.list_all_datasets()

Look at what datasets area available with list_all_datasets()

ucid.list_all_datasets()

2_list_all_datasets

object.limit(fieldname, value_to_match)

If you want to select only a single kind of dataset, limit to a single value with limit().

ucid.limit("Area", "Business")
print(ucid)
ucid.list_all_datasets()

3_limit

object.show_me_dataset(ID)

Wow that's too many datasets all at once.

Let's just look at one with show_me_dataset(ID)

ds = UC_Irvine_datasets()
ds = ds.show_me_dataset("wine-quality")
print(ds)

4_show_me_dataset

object.load_small_dataset_df(ID)

There's a flag set on this data set called small = 1.

In this case that means that our team decided the dataset was sufficiently small to safely import directly as a dataframe.

You can try to import any small dataset as load_small_dataset_df(ID)

Note that if there are multiple datasets available, only the first dataset is loaded.

test_load_df = ucid.load_small_dataset_df("wine-quality")
print(f"There are {len(test_load_df.index)} rows")
print(test_load_df.head())

5_load_small_dataset_df

df_first_row_to_header(df)

Sometimes the datasets come with headers, and sometimes they don't.

df_first_row_to_header(df) will resolve this issue.

test_load_df = df_first_row_to_header(test_load_df)
print("\n\n### WITH HEADERS CORRECTED ###\n")
print(test_load_df.head())

6_df_first_row_to_header

object.small_datasets_only()

object.small_datasets_only() will return a new object of only "small" datasets.

Let's see what returns from using it.

small_ucid = UC_Irvine_datasets().small_datasets_only()
small_uci = small_ucid.list_all_datasets()
print(small_ucid)

7_small_datasets_only

The object can also produce simple plots from the dataframe:

ucid.print_distribution("NumberofInstances")

8_Histogram

ucid.print_barplot("year_donated","NumberofWebHits", colorcol="header")

9_barplot

About

Michael Pajewski, Buckley Dowdle, Steve Morris, and Kip McCharen analyzed the UC Irvine Machine Learning Repo and developed an analysis dataset which anyone can access here.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •