-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability #32
base: dev
Are you sure you want to change the base?
Conversation
How on earth is F1 and Recall > 1.0? See repairing F1 and repairing recall. |
Pushed up a patch to update single/co-occur stats after each EM iteration for
I attempted to do more iterations but there is an issue with how we use |
Sounds good. |
32ae5ef
to
fb01e01
Compare
(singular value, old 'init_value').
iterations for repair.
multiple init values by specifying init values in raw data separated by |||.
fb01e01
to
39088f4
Compare
Newest results with this patch with fix to
Latest changes:
Ready for another review 👀 |
This PR introduces EM iterations to the repair process where after every iteration as well as supporting multiple init values:
current_value
and renamed from e.g.InitFeaturizer
toCurrentFeaturizer
current_value
s incell_domain
with inferred values frominf_vals_dom
current_value
s (featurizers such asCurrentAttrFeaturizer
orCurrentXFeaturizer
) can take advantage of the updated current valuesInitSimFeaturizer
where it wasn't computing the similarity metrics correctly between theinit_value
and values in the domainNULL
values inNullDetector
current_value
is initialized with the value frominit_values
with the highest sum of co-occurrence probabilities with the otherinit_values
in the tupleI've tested this with 3 iterations with the hospital dataset. On the second iteration we see an improvement in recall (with a slight hit to precision) due to the increased number of repairs made. It seems to converge after the 2nd iteration.
NB: this PR does not currently include the detection process in the EM iterations: this might be worth considering.