java中如何计算两个字符串的相似度?
2021-03-13 13:42
标签:imu tran exe osi eth pre javadoc follow mil 1.Cosine similarity
* For further explanation about the Cosine Similarity, refer to
* http://en.wikipedia.org/wiki/Cosine_similarity.
* 2.JaccardSimilarity
* For further explanation about Jaccard Similarity, refer
* https://en.wikipedia.org/wiki/Jaccard_index
* 3.LevenshteinDistance 4.JaroWinklerDistance 参考资料: 【1】https://stackoverflow.com/questions/955110/similarity-string-comparison-in-java 【2】https://zatackcoder.com/java-program-to-check-two-strings-similarity/ 【3】https://commons.apache.org/sandbox/commons-text/jacoco/org.apache.commons.text.similarity/index.source.html java中如何计算两个字符串的相似度? 标签:imu tran exe osi eth pre javadoc follow mil 原文地址:https://blog.51cto.com/15015181/2556388package org.apache.commons.text.similarity;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
/**
* Measures the Cosine similarity of two vectors of an inner product space and
* compares the angle between them.
*
*
package org.apache.commons.text.similarity;
import java.util.HashSet;
import java.util.Set;
/**
* Measures the Jaccard similarity (aka Jaccard index) of two sets of character
* sequence. Jaccard similarity is the size of the intersection divided by the
* size of the union of the two sets.
*
*
/**
* LevenshteinDistance
* copied from https://commons.apache.org/proper/commons-lang/javadocs/api-2.5/src-html/org/apache/commons/lang/StringUtils.html#line.6162
*/
public static int getLevenshteinDistance(String s, String t) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
int n = s.length(); // length of s
int m = t.length(); // length of t
if (n == 0) {
return m;
} else if (m == 0) {
return n;
}
if (n > m) {
// swap the input strings to consume less memory
String tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n + 1]; //‘previous‘ cost array, horizontally
int d[] = new int[n + 1]; // cost array, horizontally
int _d[]; //placeholder to assist in swapping p and d
// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t
char t_j; // jth character of t
int cost; // cost
for (i = 0; i
/**
* JaroWinklerDistance
* Copied from https://commons.apache.org/sandbox/commons-text/jacoco/org.apache.commons.text.similarity/JaroWinklerDistance.java.html
* apply method changed to similarity
*/
public static Double similarity(final CharSequence left, final CharSequence right) {
final double defaultScalingFactor = 0.1;
final double percentageRoundValue = 100.0;
if (left == null || right == null) {
throw new IllegalArgumentException("Strings must not be null");
}
int[] mtp = matches(left, right);
double m = mtp[0];
if (m == 0) {
return 0D;
}
double j = ((m / left.length() + m / right.length() + (m - mtp[1]) / m)) / 3;
double jw = j second.length()) {
max = first;
min = second;
} else {
max = second;
min = first;
}
int range = Math.max(max.length() / 2 - 1, 0);
int[] matchIndexes = new int[min.length()];
Arrays.fill(matchIndexes, -1);
boolean[] matchFlags = new boolean[max.length()];
int matches = 0;
for (int mi = 0; mi