Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability #32

Open
wants to merge 14 commits into
base: dev
Choose a base branch
from

Conversation

richardwu
Copy link
Collaborator

@richardwu richardwu commented Nov 22, 2018

This PR introduces EM iterations to the repair process where after every iteration as well as supporting multiple init values:

  • created separate column for init values (1 or more) and current value (singular value, old 'init_value')
  • all featurizers have been changed to reference current_value and renamed from e.g. InitFeaturizer to CurrentFeaturizer
  • update current_values in cell_domain with inferred values from inf_vals_dom
  • re-run featurization + training + inference with new current_values (featurizers such as CurrentAttrFeaturizer or CurrentXFeaturizer) can take advantage of the updated current values
  • fixed a bug in InitSimFeaturizer where it wasn't computing the similarity metrics correctly between the init_value and values in the domain
  • fixed a bug where we weren't properly detecting NULL values in NullDetector
  • current_value is initialized with the value from init_values with the highest sum of co-occurrence probabilities with the other init_values in the tuple

I've tested this with 3 iterations with the hospital dataset. On the second iteration we see an improvement in recall (with a slight hit to precision) due to the increased number of repairs made. It seems to converge after the 2nd iteration.

INFO:root:Precision = 1.00, Recall = 0.43, Repairing Recall = 0.48, F1 = 0.60, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 219, Total Repairs = 219, Total Repairs (clean data) = 219

INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222

INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222

NB: this PR does not currently include the detection process in the EM iterations: this might be worth considering.

@thodrek
Copy link

thodrek commented Nov 22, 2018

How on earth is F1 and Recall > 1.0? See repairing F1 and repairing recall.

@richardwu
Copy link
Collaborator Author

richardwu commented Nov 22, 2018

Pushed up a patch to update single/co-occur stats after each EM iteration for OccurFeaturizer. Interestingly enough for the second iteration our recall goes up but our precision goes down (since we are doing more repairs):

// After iteration 1
INFO:root:Precision = 0.93, Recall = 0.68, Repairing Recall = 0.76, F1 = 0.79, Repairing F1 = 0.84, Detected Errors = 458, Total Errors = 509, Correct Repairs = 347, Total Repairs = 372, Total Repairs (clean data) = 372

// After iteration 2
INFO:root:Precision = 0.89, Recall = 0.71, Repairing Recall = 0.79, F1 = 0.79, Repairing F1 = 0.83, Detected Errors = 458, Total Errors = 509, Correct Repairs = 361, Total Repairs = 407, Total Repairs (clean data) = 407

I attempted to do more iterations but there is an issue with how we use Pools where we allocate a new pool of workers every time. I'll fix this in a separate PR.

@richardwu
Copy link
Collaborator Author

@thodrek I forgot to update current_value to init_values for total_repairs_clean. I've since fixed it (

t1.init_values != t2.rv_value
).

@thodrek
Copy link

thodrek commented Nov 22, 2018

Sounds good.

@richardwu richardwu changed the title DO NOT MERGE: Added EM iterations to repair process DO NOT MERGE: Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability Nov 24, 2018
@richardwu richardwu changed the title DO NOT MERGE: Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability Nov 24, 2018
@richardwu
Copy link
Collaborator Author

Newest results with this patch with fix to InitAttrFeaturizer (now called CurrentAttrFeaturizer

INFO:root:Precision = 1.00, Recall = 0.43, Repairing Recall = 0.48, F1 = 0.60, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 219, Total Repairs = 219, Total Repairs (clean data) = 219

INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222

INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222

Latest changes:

  • multiple initial values in raw dataset (values separated by '|||') work now
  • current_stats=True will enable statistics to be re-collected on new current values after each EM iteration

Ready for another review 👀

@thodrek thodrek requested review from thodrek and removed request for ScarletGuo November 24, 2018 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants