[dedupe] for strict abbreviations (defined as sharing a prefix and a suffix, and containing matches+gaps only by the subtotaling affine gap measure), using the greater of the two scores. This accounts for cases where the abbreviated version may have a much higher weight in one string than the non-abbreviated version does in the other. Same for acronym alignments. Making sure there's a common prefix in regular abbeviation detection Capping the Soft-TFIDF similarity at 1.0.

This commit is contained in:
Al
2018-01-25 14:19:44 -05:00
parent b4cc7395a2
commit d0fe31d359
3 changed files with 34 additions and 5 deletions

View File

@@ -39,6 +39,8 @@ typedef struct soft_tfidf_options {
size_t damerau_levenshtein_max;
size_t damerau_levenshtein_min_length;
bool possible_affine_gap_abbreviations;
size_t strict_abbreviation_min_length;
double strict_abbreviation_sim;
} soft_tfidf_options_t;
soft_tfidf_options_t soft_tfidf_default_options(void);