java中如何计算两个字符串的相似度?

2021-03-13 13:42

阅读:377

标签:imu   tran   exe   osi   eth   pre   javadoc   follow   mil   

发现apache提供了现成的解决方案
技术图片

1.Cosine similarity

package org.apache.commons.text.similarity;

import java.util.HashSet;
import java.util.Map;
import java.util.Set;

/**
 * Measures the Cosine similarity of two vectors of an inner product space and
 * compares the angle between them.
 *
 * 

* For further explanation about the Cosine Similarity, refer to * http://en.wikipedia.org/wiki/Cosine_similarity. *

* * @since 1.0 */ public class CosineSimilarity { /** * Calculates the cosine similarity for two given vectors. * * @param leftVector left vector * @param rightVector right vector * @return cosine similarity between the two vectors */ public Double cosineSimilarity(final Map leftVector, final Map rightVector) { if (leftVector == null || rightVector == null) { throw new IllegalArgumentException("Vectors must not be null"); } final Set intersection = getIntersection(leftVector, rightVector); final double dotProduct = dot(leftVector, rightVector, intersection); double d1 = 0.0d; for (final Integer value : leftVector.values()) { d1 += Math.pow(value, 2); } double d2 = 0.0d; for (final Integer value : rightVector.values()) { d2 += Math.pow(value, 2); } double cosineSimilarity; if (d1 getIntersection(final Map leftVector, final Map rightVector) { final Set intersection = new HashSet(leftVector.keySet()); intersection.retainAll(rightVector.keySet()); return intersection; } /** * Computes the dot product of two vectors. It ignores remaining elements. It means * that if a vector is longer than other, then a smaller part of it will be used to compute * the dot product. * * @param leftVector left vector * @param rightVector right vector * @param intersection common elements * @return the dot product */ private double dot(final Map leftVector, final Map rightVector, final Set intersection) { long dotProduct = 0; for (final CharSequence key : intersection) { dotProduct += leftVector.get(key) * rightVector.get(key); } return dotProduct; } }

2.JaccardSimilarity

package org.apache.commons.text.similarity;

import java.util.HashSet;
import java.util.Set;

/**
 * Measures the Jaccard similarity (aka Jaccard index) of two sets of character
 * sequence. Jaccard similarity is the size of the intersection divided by the
 * size of the union of the two sets.
 *
 * 

* For further explanation about Jaccard Similarity, refer * https://en.wikipedia.org/wiki/Jaccard_index *

* * @since 1.0 */ public class JaccardSimilarity implements SimilarityScore { /** * Calculates Jaccard Similarity of two set character sequence passed as * input. * * @param left first character sequence * @param right second character sequence * @return index * @throws IllegalArgumentException * if either String input {@code null} */ @Override public Double apply(CharSequence left, CharSequence right) { if (left == null || right == null) { throw new IllegalArgumentException("Input cannot be null"); } return Math.round(calculateJaccardSimilarity(left, right) * 100d) / 100d; } /** * Calculates Jaccard Similarity of two character sequences passed as * input. Does the calculation by identifying the union (characters in at * least one of the two sets) of the two sets and intersection (characters * which are present in set one which are present in set two) * * @param left first character sequence * @param right second character sequence * @return index */ private Double calculateJaccardSimilarity(CharSequence left, CharSequence right) { Set intersectionSet = new HashSet(); Set unionSet = new HashSet(); boolean unionFilled = false; int leftLength = left.length(); int rightLength = right.length(); if (leftLength == 0 || rightLength == 0) { return 0d; } for (int leftIndex = 0; leftIndex

3.LevenshteinDistance

/**
     * LevenshteinDistance
 * copied from https://commons.apache.org/proper/commons-lang/javadocs/api-2.5/src-html/org/apache/commons/lang/StringUtils.html#line.6162
 */
 public static int getLevenshteinDistance(String s, String t) {
 if (s == null || t == null) {
 throw new IllegalArgumentException("Strings must not be null");
 }

 int n = s.length(); // length of s
 int m = t.length(); // length of t

 if (n == 0) {
 return m;
 } else if (m == 0) {
 return n;
 }

 if (n > m) {
 // swap the input strings to consume less memory
 String tmp = s;
 s = t;
 t = tmp;
 n = m;
 m = t.length();
 }

 int p[] = new int[n + 1]; //‘previous‘ cost array, horizontally
 int d[] = new int[n + 1]; // cost array, horizontally
 int _d[]; //placeholder to assist in swapping p and d

 // indexes into strings s and t
 int i; // iterates through s
 int j; // iterates through t

 char t_j; // jth character of t

 int cost; // cost

 for (i = 0; i 

4.JaroWinklerDistance

/**
 * JaroWinklerDistance
 * Copied from https://commons.apache.org/sandbox/commons-text/jacoco/org.apache.commons.text.similarity/JaroWinklerDistance.java.html
 * apply method changed to similarity
 */
 public static Double similarity(final CharSequence left, final CharSequence right) {
 final double defaultScalingFactor = 0.1;
 final double percentageRoundValue = 100.0;

 if (left == null || right == null) {
 throw new IllegalArgumentException("Strings must not be null");
 }

 int[] mtp = matches(left, right);
 double m = mtp[0];
 if (m == 0) {
 return 0D;
 }
 double j = ((m / left.length() + m / right.length() + (m - mtp[1]) / m)) / 3;
 double jw = j  second.length()) {
 max = first;
 min = second;
 } else {
 max = second;
 min = first;
 }
 int range = Math.max(max.length() / 2 - 1, 0);
 int[] matchIndexes = new int[min.length()];
 Arrays.fill(matchIndexes, -1);
 boolean[] matchFlags = new boolean[max.length()];
 int matches = 0;
 for (int mi = 0; mi 

参考资料:

【1】https://stackoverflow.com/questions/955110/similarity-string-comparison-in-java

【2】https://zatackcoder.com/java-program-to-check-two-strings-similarity/

【3】https://commons.apache.org/sandbox/commons-text/jacoco/org.apache.commons.text.similarity/index.source.html

java中如何计算两个字符串的相似度?

标签:imu   tran   exe   osi   eth   pre   javadoc   follow   mil   

原文地址:https://blog.51cto.com/15015181/2556388


评论


亲,登录后才可以留言!