Commit Graph

8 Commits

Author SHA1 Message Date
Al
b4cc7395a2 [fix] was missing some shorter tokens that are unicode equal in Soft-TFIDF 2018-01-25 04:23:35 -05:00
Al
c4aaee7dbf [dedupe/similarity] also utilizing the L2 norm in similarity when acronyms are detected. Similarity in this case should be the acronym token's score * the L2 norm of the expanded tokens' scores in the longer string 2018-01-23 01:20:14 -05:00
Al
eb3fb37ad4 [similarity/dedupe] normalizing by the product of the L2 norms in soft token similarity function, as in cosine similarity. Score vectors should be passed in unnormalized, and typically with unit length. Also, for aligned phrases that share the same canonical phrase, contribute the product of the two norms of the phrase vectors to the similarity's numerator (maximum value, as if each token in both strings had matched exactly). The previous version over-counted the importance of aligned multi-word phrases by doing a cross product, which could overshadow other more important terms. 2018-01-22 01:38:12 -05:00
Al
e935f2a036 [fix] need to calculate max Jaro-Winkler for other methods, so only test whether we should use it after we've cycled through all the tokens 2018-01-06 03:59:34 -05:00
Al
4356174630 [similarity] adding a match count in Soft-TFIDF to allow answering questions about subsets i.e. the set of tokens in "Park Pl" contain the set of tokens in "Park". Setting Jaro-Winkler minimum length of 4 chars on, more specific option name for possible abbeviation detection 2018-01-06 03:50:03 -05:00
Al
434bbd4dc2 [fix] removing unused vars 2017-12-30 02:31:43 -05:00
Al
f1e6886536 [similarity/dedupe] adding options for acronym alignments and address phrase matches in Soft-TFIDF. Acronym alignments will give higher similarity to NYU vs. "New York University" whereas phrase matches would match known phrases that share the same canonical like "Cty Rd" vs. "C.R." vs. "County Road" within the Soft-TFIDF similarity calculation. 2017-12-29 02:39:49 -05:00
Al
b90c3dab4b [similarity/dedupe] adding Soft-TFIDF implementation with several different fallback qualifiers for the max-sim function (Damerau-Levenshtein and libpostal's new bucketed affine gap method for detecting abbreviations), but keeping Jaro-Winkler as the secondary similarity function in the final distance metric. Overall this should results in higher similarity values when one of the tokens may not quite match the pure secondary threshold in terms of Jaro-Winkler but may match on one of the other criteria. 2017-12-28 04:34:46 -05:00