Commit Graph

6 Commits

Author SHA1 Message Date
Al
b2dcb18d7e [dedupe] account for missing ordinal suffixes in Soft-TFIDF deduping i.e. to count 1st Place and 1 Plce as the same where there might be a misspelling and the phrase wouldn't match under exact expansions 2018-02-23 23:44:05 -05:00
Al
b03fbdd681 [dedupe] adding multi-word phrase alignments to deduping 2018-02-23 01:28:02 -05:00
Al
d0fe31d359 [dedupe] for strict abbreviations (defined as sharing a prefix and a suffix, and containing matches+gaps only by the subtotaling affine gap measure), using the greater of the two scores. This accounts for cases where the abbreviated version may have a much higher weight in one string than the non-abbreviated version does in the other. Same for acronym alignments. Making sure there's a common prefix in regular abbeviation detection Capping the Soft-TFIDF similarity at 1.0. 2018-01-25 14:23:18 -05:00
Al
4356174630 [similarity] adding a match count in Soft-TFIDF to allow answering questions about subsets i.e. the set of tokens in "Park Pl" contain the set of tokens in "Park". Setting Jaro-Winkler minimum length of 4 chars on, more specific option name for possible abbeviation detection 2018-01-06 03:50:03 -05:00
Al
f1e6886536 [similarity/dedupe] adding options for acronym alignments and address phrase matches in Soft-TFIDF. Acronym alignments will give higher similarity to NYU vs. "New York University" whereas phrase matches would match known phrases that share the same canonical like "Cty Rd" vs. "C.R." vs. "County Road" within the Soft-TFIDF similarity calculation. 2017-12-29 02:39:49 -05:00
Al
b90c3dab4b [similarity/dedupe] adding Soft-TFIDF implementation with several different fallback qualifiers for the max-sim function (Damerau-Levenshtein and libpostal's new bucketed affine gap method for detecting abbreviations), but keeping Jaro-Winkler as the secondary similarity function in the final distance metric. Overall this should results in higher similarity values when one of the tokens may not quite match the pure secondary threshold in terms of Jaro-Winkler but may match on one of the other criteria. 2017-12-28 04:34:46 -05:00