Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text 158 #112

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/changes/changes.xml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ The <action> type attribute can be add,update,fix,remove.
<action issue="TEXT-156" type="update" dev="aherbert">Fix the RegexTokenizer to use a static Pattern</action>
<action issue="TEXT-157" type="update" dev="aherbert">Remove rounding from JaccardDistance and JaccardSimilarity</action>
<action issue="TEXT-151" type="fix" dev="aherbert">Fix the JaroWinklerSimilarity to use StringUtils.equals to test for CharSequence equality</action>
<action issue="TEXT-158" type="fix" dev="milaKisialiova" due-to="@CAPS50">Fix the JaccardSimilarity to return correct similarity for two empty strings</action>
</release>

<release version="1.6" date="2018-10-12" description="Release 1.6">
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ private Double calculateJaccardSimilarity(final CharSequence left, final CharSeq
final int leftLength = left.length();
final int rightLength = right.length();
if (leftLength == 0 || rightLength == 0) {
return 0d;
return leftLength == rightLength ? 1d : 0d;
}
final Set<Character> leftSet = new HashSet<>();
for (int i = 0; i < leftLength; i++) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ public static void setUp() {
@Test
public void testGettingJaccardDistance() {
// Expected Jaccard distance = 1.0 - (intersect / union)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this comment is out of sync with the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, can you describe why this comment looks wrong? For me it's good, shows the formula, as well as in JaccardSimilarityTest.java
// Expected Jaccard similarity = (intersect / union)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the comment reads:

// Expected Jaccard distance = 1.0 - (intersect / union)

and the next line is in contradiction with this comment, expecting 0.0, not 1.0.

assertEquals(0.0, classBeingTested.apply("", ""));

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @garydgregory , I'm still not convinced.
Intersection of two empty sets is empty set, union is also empty set. They are equal. So Jaccard similarity is 1. And therefore distance is 0.

assertEquals(1.0, classBeingTested.apply("", ""));
assertEquals(0.0, classBeingTested.apply("", ""));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works because the unit test passes in the same string on both sides, i.e. "" vs "".

Do not use == to measure equality. This measures the same object reference. Try this in your test:

// Change 
assertEquals(0.0, classBeingTested.apply("", ""));
// to 
assertEquals(0.0, classBeingTested.apply("", new StringBuilder()));

You should fix the code to use StringUtils.equals(CharSequence, CharSequence) to test for equality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, but

  1. Operator == is used to measure two int values (length).
  2. with new StringBuilder() also works well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you are right. It was my mistake carried over from fixing another similarity class to use StringUtils.equals.

However you should consider using StringUtils.equals before splitting all the sequences into a set to provide a fast exit in the case of equal input. This is done in JaroWinklerSimilarity.

assertEquals(1.0, classBeingTested.apply("left", ""));
assertEquals(1.0, classBeingTested.apply("", "right"));
assertEquals(1.0 - (3.0 / 4), classBeingTested.apply("frog", "fog"));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ public static void setUp() {
@Test
public void testGettingJaccardSimilarity() {
// Expected Jaccard similarity = (intersect / union)
assertEquals(0.0, classBeingTested.apply("", ""));
assertEquals(1.0, classBeingTested.apply("", ""));
assertEquals(0.0, classBeingTested.apply("left", ""));
assertEquals(0.0, classBeingTested.apply("", "right"));
assertEquals(3.0 / 4, classBeingTested.apply("frog", "fog"));
Expand Down