TEAM_03_INF550_HW_BIG_DATA

This is Team 03's submission of Assignment 1: Analysis of Media and Semantic Forensics in Scientific Literature from USC Viterbi's INF550: Data Science at Scale, Spring 2020 course. The contents of this project are listed below:

analysis: directory with similarity scores, visualizations, and R code cluster analysis
data-modification: directory with original Bik .tsv, and R code that joins in results from our other datasets and performs cleanup for analysis
rselenium-chrome: directory for our web-scraping scripts
requirements.R: an R script that installs all packages used to develop this project
TEAM_03_BIGDATA.pdf: our final report
TEAM_03_UPDATED_DATSET.tsv: our updated dataset
README.txt: .txt version of this readme
README.md: this file; readme for github

The rselenium-chrome folder has its own README as well, which details its configuration and contents (including test files to demonstrate the crawler).

Our project also involved modified files from Tika-Similarity, but we could not get those to appear in the github repository for this project (because Tika-Similarity has its own, most likely). The modifications are available in the pull request here: chrismattmann/tika-similarity#97

Team Members

Cherry, Carlin - [email protected] - 8211265507
Lee, Matthew - [email protected] - 4356300240
May, David - [email protected] - 5801939142

For any inquires on code, please contact Matt at [email protected]. Thank you!

Software

R 3.6.2
RStudio
Rtools 3.5.0.4 (if on Windows)
Docker Desktop
Python 2.7
Tika-Python
Tika-Similarity
D3 (built-in to Tika-Similarity)

R Packages

A list of all packaged installed into RStudio for the development of this project, and a brief description of what the package was for. The file requirements.R included in the main project directory contains code to install all the necessary packages. To install all packages from that file, open RStudio, change the working directory to the location of requirements.R, and run source("requirements.R")

tidyverse: functions to manipulate datasets, plotting through ggplot2, pipe operator
lubridate: for formatting dates
RSelenium: bindings for running Selenium through R (and Docker)
getProxy: for "free" proxies; experimental
keyring: for password management [1]
jsonlite: for reading and writing JSON
gtrendsR: for querying from Google Trends API
naniar: for missingness map plot
factoextra: for principal component analysis plot
cluster: for distance measures and clustering algorithms
Rtsne: for t-SNE plots, used to visualize clusters
arsenal: for summary tables of clusters
gridExtra: for plotting

[1] requires some configuration, but we only used it for scraping LinkedIn with an alternate account; see r-keyring-tutorial.R in the rselenium-chrome/keyring folder.

Glossary of Features

A list of features in our updated dataset. Features in the original dataset with no changes have the note "[no change]"; otherwise, there is a brief description of what we modified (if a modified original feature) or where the feature is from (if added from data blending).

A ^ before the name indicates a feature that we excluded from the final clustering results (i.e. not used to calculate the distance measure for determining clusters; we still examined these features in exploring the clusters). Some exclusions are for improving accuracy; others were due to time constraints and the need to push forward with the analysis.

^authors: [no change]
^title: [no change]
^citation: [no change]
^doi: [no change]
year: [no change]
^month: extracted text from citation
simple_duplication: renamed from 0; replaced NA with 0
reposition_duplication: renamed from 1; replaced NA with 0
alteration_duplication: renamed from 2; replaced NA with 0
cuts_and_beautification: renamed from 3; replaced NA with 0
findings: [no change]
^reported: replaced NA with 0
correction_date: [no change]
retraction: replaced NA with 0
correction: replaced NA with 0
no_action: replaced NA with 0
^completed: replaced NA with 0
^first_author: extracted text from authors
affiliation: from Microsoft Academic; author's current associated organization
publications: from Microsoft Academic; times author has published a paper
citations: from Microsoft Academic; times any of author's papers have been cited
year_start: from Microsoft Academic; year of first publication
year_end: from Microsoft Academic; year of last publication
top_topics: from Microsoft Academic; fields of study associated with given author
publication_types: from Microsoft Academic; mediums of publication from given author
top_authors: from Microsoft Academic; other authors associated with given author
top_journals: from Microsoft Academic; journals given author has published in
top_institutions: from Microsoft Academic; institutions associated with given author
top_conferences: from Microsoft Academic; conferences given author has presented at
lab_size_approx: count of distinct, non-NA values in top_authors
publication_variety: count of distinct, non-NA values in publication_types
journal_variety: count of distinct, non-NA values in top_journals
institution_variety: count of distinct, non-NA values in top_institutions
conference_variety: count of distinct, non-NA values in top_conferences
biology: 1 if top_topics contains "biology"; 0 otherwise
medicine: 1 if top_topics contains "medicine"; 0 otherwise
immunology: 1 if top_topics contains "immunology"; 0 otherwise
cancer: 1 if top_topics contains "cancer"; 0 otherwise
biochemistry: 1 if top_topics contains "biochemistry"; 0 otherwise
virology: 1 if top_topics contains "virology"; 0 otherwise
career_duration: 2020 - year_start
publication_rate: publications / career_duration
highest_degree: from LinkedIn; degree title of first education record
degree_area: from LinkedIn; field of study of first education record
degree_level: highest_degree expressed as a number, with more weight on higher honors
total_cites: from InCites; total citations of a given journal
impact_factor: from InCites; a measure of journal prestige within its field
eigenfactor: from InCites; importance of journal considering citations
web_interest: from Google Trends; interest score for web searches
images_interest: from Google Trends; interest score for image searches
youtube_interest: from Google Trends; interest score for video searches
^academic_reputation: from USNews; score for university reputation
^affiliation_funding: from manual research; whether a university is public or private

Other Material

Some of our analysis involve similarity/clustering techniques not mentioned in lecture (and accumulated gradually from years of reading stackexchange posts...). These resources help provide some context behind such terms.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
analysis		analysis
data-modification		data-modification
rselenium-chrome		rselenium-chrome
.DS_Store		.DS_Store
ID-104-Ethanol Extracts of Fruiting Bodies of Antrodia cinnamomea Suppress CL1-5 Human Lung Adenocarcinoma Cells Migration by Inhibiting Matrix Metalloproteinase-2:9 through ERK, JNK, p38 and PI3K:Akt Signaling Pathways.pdf		ID-104-Ethanol Extracts of Fruiting Bodies of Antrodia cinnamomea Suppress CL1-5 Human Lung Adenocarcinoma Cells Migration by Inhibiting Matrix Metalloproteinase-2:9 through ERK, JNK, p38 and PI3K:Akt Signaling Pathways.pdf
ID-111-Targeting breast cancerinitiating_stem cells with melanoma differentiationassociated gene7_interleukin24.pdf		ID-111-Targeting breast cancerinitiating_stem cells with melanoma differentiationassociated gene7_interleukin24.pdf
ID-113-HtrA1 in human urothelial bladder cancer_ A secreted protein and a potential novel biomarker.pdf		ID-113-HtrA1 in human urothelial bladder cancer_ A secreted protein and a potential novel biomarker.pdf
ID-136-Low titre autoantibodies against recoverin in sera of patients with small cell lung cancer but without a loss of vision.pdf		ID-136-Low titre autoantibodies against recoverin in sera of patients with small cell lung cancer but without a loss of vision.pdf
ID-153-TNF-a down-regulates the NaCeKC ATPase and the Na+-K+-2Cl- cotransporter in the rat colon via PGE2.pdf		ID-153-TNF-a down-regulates the NaCeKC ATPase and the Na+-K+-2Cl- cotransporter in the rat colon via PGE2.pdf
ID-154-Roles of p38 and ERK MAP kinases in IL-8 expression in TNF-a- and dexamethasone-stimulated human periodontal ligament cells.pdf		ID-154-Roles of p38 and ERK MAP kinases in IL-8 expression in TNF-a- and dexamethasone-stimulated human periodontal ligament cells.pdf
INF550_HW_BIGDATA_SEMAFOR.pdf		INF550_HW_BIGDATA_SEMAFOR.pdf
INF550_HW_EXTRACT_DISINFO_SEMAFOR.pdf		INF550_HW_EXTRACT_DISINFO_SEMAFOR.pdf
INF550_HW_WEBDATAVIZ_SEMAFOR.pdf		INF550_HW_WEBDATAVIZ_SEMAFOR.pdf
ProjectTeamID_03_v2.tsv		ProjectTeamID_03_v2.tsv
README.md		README.md
README.txt		README.txt
TEAM_03_BIGDATA.pdf		TEAM_03_BIGDATA.pdf
TEAM_03_DATAVIS.pdf		TEAM_03_DATAVIS.pdf
TEAM_03_EXTRACT.pdf		TEAM_03_EXTRACT.pdf
TEAM_03_INF550_HW_BIGDATA.zip		TEAM_03_INF550_HW_BIGDATA.zip
TEAM_03_UPDATED_DATSET.tsv		TEAM_03_UPDATED_DATSET.tsv
requirements.R		requirements.R

carlincherry/FalsifiedScientificPaperDetection

Folders and files

Latest commit

History

Repository files navigation

TEAM_03_INF550_HW_BIG_DATA

Team Members

Software

R Packages

Glossary of Features

Other Material

About

Resources

Stars

Watchers

Forks

Languages