Skip to content

R package for guessing gender given a name (with 99.3% hit rate).

Notifications You must be signed in to change notification settings

cassiopagnoncelli/genderguess

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synopsis

genderguess is an R library to guess gender given a name and is part of a data enrichment suite.

Installation

library('devtools')
install_github('cassiopagnoncelli/genderguess')

Usage

gender_guess offers a way to infer genders based on names suffixes. gender is a wrapper to produce responses from/to JSON.

> gender_guess("Cassio")
1 
m 
Levels: f m
> gender('{"names":[
  "Cássius",
  "g4BRIéL",
  "EduArda",
  "Geyzebel Cardoso",
  "JULIAN",
  "  Pritham-Kumar Bora Bora",
  "NAARA Katheline Cilva",
  "SaMAnTa Fyorrentin"
]}')
["m","m","f","f","m","m","f","f"] 

Description

gender is a machine learning model for classifying names as either male or female. It works fine for Brazilian names and comes along with a names database, albeit it is more general and could be deployed for purposes other than national names.

This model is a two-tier layer casting a voters-judge architecture.

  1. ETL phase: Responsible for converting input name, whichever the format it is given, into its first name suffix and transliterating it to Latin alphabet.
  2. First layer phase: Suffix is given as input to four different, trained classifiers, hereby called voters, such that each one output one different vote whether the name should be m or f. Classifiers used here are SVM, Random Forest, Decision Tree, and Neural Networks. (Individual voters average a hit rate of 80-85%.)
  3. Second layer phase: Given individual votes for each classifier, an aggregator classifier, hereby called judge, decides, based on votes, which should be the gender for the given name. (Classication hit rate soars to over 97% out of individual classifiers.)

Diagram

Caveats

Depending on machine conditions performance accrued from bulk calls can be over a few thousands classified instances.

As per independent calls classifier shows its overhead, hence it is recommended to prepare calls for bulk classification.

Deployment

Package can be deployed as

  • a JSON webservice,
  • an R script running in a cron.

About

R package for guessing gender given a name (with 99.3% hit rate).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages