Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability #32

richardwu · 2018-11-22T02:44:14Z

This PR introduces EM iterations to the repair process where after every iteration as well as supporting multiple init values:

created separate column for init values (1 or more) and current value (singular value, old 'init_value')
all featurizers have been changed to reference current_value and renamed from e.g. InitFeaturizer to CurrentFeaturizer
update current_values in cell_domain with inferred values from inf_vals_dom
re-run featurization + training + inference with new current_values (featurizers such as CurrentAttrFeaturizer or CurrentXFeaturizer) can take advantage of the updated current values
fixed a bug in InitSimFeaturizer where it wasn't computing the similarity metrics correctly between the init_value and values in the domain
fixed a bug where we weren't properly detecting NULL values in NullDetector
current_value is initialized with the value from init_values with the highest sum of co-occurrence probabilities with the other init_values in the tuple

I've tested this with 3 iterations with the hospital dataset. On the second iteration we see an improvement in recall (with a slight hit to precision) due to the increased number of repairs made. It seems to converge after the 2nd iteration.

INFO:root:Precision = 1.00, Recall = 0.43, Repairing Recall = 0.48, F1 = 0.60, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 219, Total Repairs = 219, Total Repairs (clean data) = 219

INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222

INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222

NB: this PR does not currently include the detection process in the EM iterations: this might be worth considering.

thodrek · 2018-11-22T17:41:25Z

How on earth is F1 and Recall > 1.0? See repairing F1 and repairing recall.

richardwu · 2018-11-22T17:42:21Z

Pushed up a patch to update single/co-occur stats after each EM iteration for OccurFeaturizer. Interestingly enough for the second iteration our recall goes up but our precision goes down (since we are doing more repairs):

// After iteration 1
INFO:root:Precision = 0.93, Recall = 0.68, Repairing Recall = 0.76, F1 = 0.79, Repairing F1 = 0.84, Detected Errors = 458, Total Errors = 509, Correct Repairs = 347, Total Repairs = 372, Total Repairs (clean data) = 372

// After iteration 2
INFO:root:Precision = 0.89, Recall = 0.71, Repairing Recall = 0.79, F1 = 0.79, Repairing F1 = 0.83, Detected Errors = 458, Total Errors = 509, Correct Repairs = 361, Total Repairs = 407, Total Repairs (clean data) = 407

I attempted to do more iterations but there is an issue with how we use Pools where we allocate a new pool of workers every time. I'll fix this in a separate PR.

richardwu · 2018-11-22T17:44:01Z

@thodrek I forgot to update current_value to init_values for total_repairs_clean. I've since fixed it (

holoclean/evaluate/eval.py

Line 164 in 32ae5ef

t1.init_values != t2.rv_value

).

thodrek · 2018-11-22T17:44:44Z

Sounds good.

(singular value, old 'init_value').

iterations for repair.

multiple init values by specifying init values in raw data separated by |||.

richardwu · 2018-11-24T02:31:06Z

Newest results with this patch with fix to InitAttrFeaturizer (now called CurrentAttrFeaturizer

INFO:root:Precision = 1.00, Recall = 0.43, Repairing Recall = 0.48, F1 = 0.60, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 219, Total Repairs = 219, Total Repairs (clean data) = 219

INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222

INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222

Latest changes:

multiple initial values in raw dataset (values separated by '|||') work now
current_stats=True will enable statistics to be re-collected on new current values after each EM iteration

Ready for another review 👀

initial values.

as DK.

thodrek requested review from minafarid and ScarletGuo November 22, 2018 17:41

richardwu force-pushed the em_for_repair branch from 32ae5ef to fb01e01 Compare November 23, 2018 23:59

richardwu changed the title ~~DO NOT MERGE: Added EM iterations to repair process~~ DO NOT MERGE: Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability Nov 24, 2018

richardwu added 8 commits November 23, 2018 19:03

Created separate column for init values (1 or more) and current value

3463b78

(singular value, old 'init_value').

Fixed some typos for current columns.

394d0bd

Addressed PR comments.

5d67501

Use list comprehension over map-list.

7c7a59c

Cleaned up some private functions and accesses to aux_tables.

58574f1

Added method to update current values with inferred values and EM

f34ba7c

iterations for repair.

Re-compute single and co-occur stats after every EM iteration.

d0fc0da

Add option to enable current stats updates. Updated code to allow

39088f4

multiple init values by specifying init values in raw data separated by |||.

richardwu force-pushed the em_for_repair branch from fb01e01 to 39088f4 Compare November 24, 2018 00:06

richardwu added 2 commits November 23, 2018 20:20

Fixed report/status in get_featurizer_weights.

77514ec

Use weight-adjusted frequency and co-occur for multiple init values.

6a437c4

richardwu mentioned this pull request Nov 24, 2018

Created separate column for init values (1 or more) and current value (singular value, old 'init_value') #30

Closed

Reduce epoches for faster runs.

fd3ba08

thodrek requested review from thodrek and removed request for ScarletGuo November 24, 2018 02:35

richardwu added 3 commits November 23, 2018 21:45

Fixed correction factor for current value selection from multiple

38b95b8

initial values.

Added MultiInitDetector for marking cells with multiple initial values

3542531

as DK.

Fixed str matching for MultiInitDetector.

6413a30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability #32

Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability #32

richardwu commented Nov 22, 2018 •

edited

Loading

thodrek commented Nov 22, 2018 •

edited

Loading

richardwu commented Nov 22, 2018 •

edited

Loading

richardwu commented Nov 22, 2018

thodrek commented Nov 22, 2018

richardwu commented Nov 24, 2018

Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability #32

Are you sure you want to change the base?

Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability #32

Conversation

richardwu commented Nov 22, 2018 • edited Loading

thodrek commented Nov 22, 2018 • edited Loading

richardwu commented Nov 22, 2018 • edited Loading

richardwu commented Nov 22, 2018

thodrek commented Nov 22, 2018

richardwu commented Nov 24, 2018

richardwu commented Nov 22, 2018 •

edited

Loading

thodrek commented Nov 22, 2018 •

edited

Loading

richardwu commented Nov 22, 2018 •

edited

Loading