Skip to content

Commit af5b4c6

Browse files
authored
Merge pull request #20 from MpoMp/master
DL Optimal String Alignment implementation
2 parents b1eaf3f + 8e80da4 commit af5b4c6

File tree

4 files changed

+215
-2
lines changed

4 files changed

+215
-2
lines changed

README.md

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ The main characteristics of each implemented algorithm are presented below. The
4545
| [Normalized Levenshtein](#normalized-levenshtein) |distance<br>similarity | Yes | No | | O(m*n) <sup>1</sup> |
4646
| [Weighted Levenshtein](#weighted-levenshtein) |distance | No | No | | O(m*n) <sup>1</sup> |
4747
| [Damerau-Levenshtein](#damerau-levenshtein) <sup>3</sup> |distance | No | Yes | | O(m*n) <sup>1</sup> |
48-
| Optimal String Alignment <sup>3</sup> |not implemented yet | No | No | | O(m*n) <sup>1</sup> |
48+
| [Optimal String Alignment](#optimal-string-alignment) <sup>3</sup> |distance | No | No | | O(m*n) <sup>1</sup> |
4949
| [Jaro-Winkler](#jaro-winkler) |similarity<br>distance | Yes | No | | O(m*n) |
5050
| [Longest Common Subsequence](#longest-common-subsequence) |distance | No | No | | O(m*n) <sup>1,2</sup> |
5151
| [Metric Longest Common Subsequence](#metric-longest-common-subsequence) |distance | Yes | Yes | | O(m*n) <sup>1,2</sup> |
@@ -210,7 +210,31 @@ Will produce:
210210
6.0
211211
```
212212

213+
## Optimal String Alignment
214+
The Optimal String Alignment variant of Damerau–Levenshtein (sometimes called the restricted edit distance) computes the number of edit operations needed to make the strings equal under the condition that **no substring is edited more than once**, whereas the true Damerau–Levenshtein presents no such restriction.
215+
The difference from the algorithm for Levenshtein distance is the addition of one recurrence for the transposition operations.
213216

217+
Note that for the optimal string alignment distance, the triangle inequality does not hold and so it is not a true metric.
218+
219+
```java
220+
import info.debatty.java.stringsimilarity.*;
221+
222+
public class MyApp {
223+
224+
225+
public static void main(String[] args) {
226+
OptimalStringAlignment osa = new OptimalStringAlignment();
227+
228+
System.out.println(osa.distance("CA", "ABC"));;
229+
}
230+
}
231+
```
232+
233+
Will produce:
234+
235+
```
236+
3.0
237+
```
214238

215239
## Jaro-Winkler
216240
Jaro-Winkler is a string edit distance that was developed in the area of record linkage (duplicate detection) (Winkler, 1990). The Jaro–Winkler distance metric is designed and best suited for short strings such as person names, and to detect typos.
Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
/*
2+
* The MIT License
3+
*
4+
* Copyright 2016 Thibault Debatty.
5+
*
6+
* Permission is hereby granted, free of charge, to any person obtaining a copy
7+
* of this software and associated documentation files (the "Software"), to deal
8+
* in the Software without restriction, including without limitation the rights
9+
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10+
* copies of the Software, and to permit persons to whom the Software is
11+
* furnished to do so, subject to the following conditions:
12+
*
13+
* The above copyright notice and this permission notice shall be included in
14+
* all copies or substantial portions of the Software.
15+
*
16+
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17+
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18+
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19+
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20+
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21+
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
22+
* THE SOFTWARE.
23+
*/
24+
package info.debatty.java.stringsimilarity;
25+
26+
import info.debatty.java.stringsimilarity.interfaces.StringDistance;
27+
import net.jcip.annotations.Immutable;
28+
29+
/**
30+
* Implementation of the the Optimal String Alignment (sometimes called the
31+
* restricted edit distance) variant of the Damerau-Levenshtein distance.
32+
*
33+
* The difference between the two algorithms consists in that the Optimal String
34+
* Alignment algorithm computes the number of edit operations needed to make the
35+
* strings equal under the condition that no substring is edited more than once,
36+
* whereas Damerau-Levenshtein presents no such restriction.
37+
*
38+
* @author Michail Bogdanos
39+
*/
40+
@Immutable
41+
public final class OptimalStringAlignment implements StringDistance {
42+
43+
/**
44+
* Compute the distance between strings: the minimum number of operations
45+
* needed to transform one string into the other (insertion, deletion,
46+
* substitution of a single character, or a transposition of two adjacent
47+
* characters) while no substring is edited more than once.
48+
*
49+
* @param s1 the first input string
50+
* @param s2 the second input string
51+
* @return the OSA distance
52+
*/
53+
public final double distance(final String s1, final String s2) {
54+
int n = s1.length(), m = s2.length();
55+
if (n == 0) return m;
56+
if (m == 0) return n;
57+
58+
59+
// Create the distance matrix H[0 .. s1.length+1][0 .. s2.length+1]
60+
int[][] d = new int[s1.length() + 2][s2.length() + 2];
61+
62+
//initialize top row and leftmost column
63+
for (int i = 0; i <= n; i++) {
64+
d[i][0] = i;
65+
}
66+
for (int j = 0; j <= m; j++) {
67+
d[0][j] = j;
68+
}
69+
70+
//fill the distance matrix
71+
int cost;
72+
73+
for (int i = 1; i <= n; i++) {
74+
for (int j = 1; j <= m; j++) {
75+
76+
//if s1[i - 1] = s2[j - 1] then cost = 0, else cost = 1
77+
cost = (s1.charAt(i - 1) == s2.charAt(j - 1)) ? 0 : 1;
78+
79+
d[i][j] = min(
80+
d[i - 1][j - 1] + cost, // substitution
81+
d[i][j - 1] + 1, // insertion
82+
d[i - 1][j] + 1 // deletion
83+
);
84+
85+
//transposition check
86+
if (i > 1 && j > 1
87+
&& s1.charAt(i - 1) == s2.charAt(j - 2)
88+
&& s1.charAt(i - 2) == s2.charAt(j - 1)
89+
){
90+
d[i][j] = Math.min(d[i][j], d[i - 2][j - 2] + cost);
91+
}
92+
}
93+
}
94+
95+
return d[n][m];
96+
}
97+
98+
private static int min(
99+
final int a, final int b, final int c) {
100+
return Math.min(a, Math.min(b, c));
101+
}
102+
}

src/main/java/info/debatty/java/stringsimilarity/examples/Examples.java

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
import info.debatty.java.stringsimilarity.CharacterSubstitutionInterface;
2727
import info.debatty.java.stringsimilarity.Cosine;
2828
import info.debatty.java.stringsimilarity.Damerau;
29+
import info.debatty.java.stringsimilarity.OptimalStringAlignment;
2930
import info.debatty.java.stringsimilarity.Jaccard;
3031
import info.debatty.java.stringsimilarity.JaroWinkler;
3132
import info.debatty.java.stringsimilarity.KShingling;
@@ -49,13 +50,15 @@ public class Examples {
4950
public static void main(String[] args) {
5051
// Levenshtein
5152
// ===========
53+
System.out.println("\nLevenshtein");
5254
Levenshtein levenshtein = new Levenshtein();
5355
System.out.println(levenshtein.distance("My string", "My $tring"));
5456
System.out.println(levenshtein.distance("My string", "M string2"));
5557
System.out.println(levenshtein.distance("My string", "My $tring"));
5658

5759
// Jaccard index
5860
// =============
61+
System.out.println("\nJaccard");
5962
Jaccard j2 = new Jaccard(2);
6063
// AB BC CD DE DF
6164
// 1 1 1 1 0
@@ -65,6 +68,7 @@ public static void main(String[] args) {
6568

6669
// Jaro-Winkler
6770
// ============
71+
System.out.println("\nJaro-Winkler");
6872
JaroWinkler jw = new JaroWinkler();
6973

7074
// substitution of s and t : 0.9740740656852722
@@ -75,6 +79,7 @@ public static void main(String[] args) {
7579

7680
// Cosine
7781
// ======
82+
System.out.println("\nCosine");
7883
Cosine cos = new Cosine(3);
7984

8085
// ABC BCE
@@ -93,6 +98,7 @@ public static void main(String[] args) {
9398

9499
// Damerau
95100
// =======
101+
System.out.println("\nDamerau");
96102
Damerau damerau = new Damerau();
97103

98104
// 1 substitution
@@ -109,9 +115,20 @@ public static void main(String[] args) {
109115

110116
// All different
111117
System.out.println(damerau.distance("ABCDEF", "POIU"));
112-
118+
119+
120+
// Optimal String Alignment
121+
// =======
122+
System.out.println("\nOptimal String Alignment");
123+
OptimalStringAlignment osa = new OptimalStringAlignment();
124+
125+
//Will produce 3.0
126+
System.out.println(osa.distance("CA", "ABC"));
127+
128+
113129
// Longest Common Subsequence
114130
// ==========================
131+
System.out.println("\nLongest Common Subsequence");
115132
LongestCommonSubsequence lcs = new LongestCommonSubsequence();
116133

117134
// Will produce 4.0
@@ -123,6 +140,7 @@ public static void main(String[] args) {
123140
// NGram
124141
// =====
125142
// produces 0.416666
143+
System.out.println("\nNGram");
126144
NGram twogram = new NGram(2);
127145
System.out.println(twogram.distance("ABCD", "ABTUIO"));
128146

@@ -134,6 +152,7 @@ public static void main(String[] args) {
134152

135153
// Normalized Levenshtein
136154
// ======================
155+
System.out.println("\nNormalized Levenshtein");
137156
NormalizedLevenshtein l = new NormalizedLevenshtein();
138157

139158
System.out.println(l.distance("My string", "My $tring"));
@@ -142,6 +161,7 @@ public static void main(String[] args) {
142161

143162
// QGram
144163
// =====
164+
System.out.println("\nQGram");
145165
QGram dig = new QGram(2);
146166

147167
// AB BC CD CE
@@ -158,6 +178,7 @@ public static void main(String[] args) {
158178

159179
// Sorensen-Dice
160180
// =============
181+
System.out.println("\nSorensen-Dice");
161182
SorensenDice sd = new SorensenDice(2);
162183

163184
// AB BC CD DE DF FG
@@ -168,6 +189,7 @@ public static void main(String[] args) {
168189

169190
// Weighted Levenshtein
170191
// ====================
192+
System.out.println("\nWeighted Levenshtein");
171193
WeightedLevenshtein wl = new WeightedLevenshtein(
172194
new CharacterSubstitutionInterface() {
173195
public double cost(char c1, char c2) {
@@ -188,6 +210,7 @@ public double cost(char c1, char c2) {
188210
System.out.println(wl.distance("String1", "Srring2"));
189211

190212
// K-Shingling
213+
System.out.println("\nK-Shingling");
191214
s1 = "my string, \n my song";
192215
s2 = "another string, from a song";
193216
KShingling ks = new KShingling(4);
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
/*
2+
* The MIT License
3+
*
4+
* Copyright 2016 Thibault Debatty.
5+
*
6+
* Permission is hereby granted, free of charge, to any person obtaining a copy
7+
* of this software and associated documentation files (the "Software"), to deal
8+
* in the Software without restriction, including without limitation the rights
9+
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10+
* copies of the Software, and to permit persons to whom the Software is
11+
* furnished to do so, subject to the following conditions:
12+
*
13+
* The above copyright notice and this permission notice shall be included in
14+
* all copies or substantial portions of the Software.
15+
*
16+
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17+
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18+
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19+
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20+
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21+
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
22+
* THE SOFTWARE.
23+
*/
24+
package info.debatty.java.stringsimilarity;
25+
26+
import static org.junit.Assert.assertEquals;
27+
import org.junit.Test;
28+
29+
/**
30+
*
31+
* @author Michail Bogdanos
32+
*/
33+
public class OptimalStringAlignmentTest {
34+
35+
/**
36+
* Test of distance method, of class OptimalStringAlignment.
37+
*/
38+
@Test
39+
public final void testDistance() {
40+
System.out.println("distance");
41+
OptimalStringAlignment instance = new OptimalStringAlignment();
42+
43+
//zero length
44+
assertEquals(6.0, instance.distance("", "ABDCEF"), 0.0);
45+
assertEquals(6.0, instance.distance("ABDCEF", ""), 0.0);
46+
assertEquals(0.0, instance.distance("", ""), 0.0);
47+
48+
//equality
49+
assertEquals(0.0, instance.distance("ABDCEF", "ABDCEF"), 0.0);
50+
51+
//single operation
52+
assertEquals(1.0, instance.distance("ABDCFE", "ABDCEF"), 0.0);
53+
assertEquals(1.0, instance.distance("BBDCEF", "ABDCEF"), 0.0);
54+
assertEquals(1.0, instance.distance("BDCEF", "ABDCEF"), 0.0);
55+
assertEquals(1.0, instance.distance("ABDCEF", "ADCEF"), 0.0);
56+
57+
//other
58+
assertEquals(3.0, instance.distance("CA", "ABC"), 0.0);
59+
assertEquals(2.0, instance.distance("BAC", "CAB"), 0.0);
60+
assertEquals(4.0, instance.distance("abcde", "awxyz"), 0.0);
61+
assertEquals(5.0, instance.distance("abcde", "vwxyz"), 0.0);
62+
63+
}
64+
}

0 commit comments

Comments
 (0)