[similarity/dedupe] adding Soft-TFIDF implementation with several different fallback qualifiers for the max-sim function (Damerau-Levenshtein and libpostal's new bucketed affine gap method for detecting abbreviations), but keeping Jaro-Winkler as the secondary similarity function in the final distance metric. Overall this should results in higher similarity values when one of the tokens may not quite match the pure secondary threshold in terms of Jaro-Winkler but may match on one of the other criteria.
This commit is contained in:
46
src/soft_tfidf.h
Normal file
46
src/soft_tfidf.h
Normal file
@@ -0,0 +1,46 @@
|
||||
#ifndef SOFT_TFIDF_H
|
||||
#define SOFT_TFIDF_H
|
||||
|
||||
#include <stdlib.h>
|
||||
#include "collections.h"
|
||||
#include "libpostal.h"
|
||||
|
||||
/*
|
||||
This is a variant of Soft-TFIDF as described in:
|
||||
|
||||
Cohen, Ravikumar, and Fienberg. A comparison of string distance
|
||||
metrics for name-matching tasks. (2003)
|
||||
https://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf
|
||||
|
||||
Soft TFIDF is a hybrid similarity function for strings, typically names,
|
||||
which combines both global statistics (TF-IDF) and a local similarity
|
||||
function (e.g. Jaro-Winkler, which the authors suggest performs best).
|
||||
|
||||
Given two strings, s1 and s2, each token t1 in s1 is matched with its most
|
||||
similar counterpart t2 in s2 according to the local distance function.
|
||||
|
||||
The Soft-TFIDF similarity is then the dot product of the max token
|
||||
similarities and the cosine similarity of the TF-IDF vectors for all tokens
|
||||
if the max similarity is >= a given threshold theta.
|
||||
|
||||
This version is a modified Soft-TFIDF. Jaro-Winkler is used as the secondary
|
||||
distance metric. However, the defintion of two tokens being "similar" is
|
||||
defined as either:
|
||||
|
||||
1. Jaro-Winkler distance >= theta
|
||||
2. Damerau-Levenshtein edit distance <= max_edit_distance
|
||||
3. Affine gap edit counts indicate a possible abbreviation (# matches == min(len1, len2))
|
||||
*/
|
||||
|
||||
typedef struct soft_tfidf_options {
|
||||
double jaro_winkler_min;
|
||||
size_t damerau_levenshtein_max;
|
||||
size_t damerau_levenshtein_min_length;
|
||||
bool use_abbreviations;
|
||||
} soft_tfidf_options_t;
|
||||
|
||||
soft_tfidf_options_t soft_tfidf_default_options(void);
|
||||
|
||||
double soft_tfidf_similarity(size_t num_tokens1, char **tokens1, double *token_scores1, size_t num_tokens2, char **tokens2, double *token_scores2, soft_tfidf_options_t options);
|
||||
|
||||
#endif
|
||||
Reference in New Issue
Block a user