Al
|
b2dcb18d7e
|
[dedupe] account for missing ordinal suffixes in Soft-TFIDF deduping i.e. to count 1st Place and 1 Plce as the same where there might be a misspelling and the phrase wouldn't match under exact expansions
|
2018-02-23 23:44:05 -05:00 |
|
Al
|
b03fbdd681
|
[dedupe] adding multi-word phrase alignments to deduping
|
2018-02-23 01:28:02 -05:00 |
|
Al
|
7cb85aa23c
|
[dedupe] to make soft token similarity order invariant, we swap the order so the shorter token sequence comes first. In the case of a tie, pick the shorter full string length
|
2018-01-26 18:04:45 -05:00 |
|
Al
|
af5a5c3039
|
[dedupe] in the case of abbreviations and acronyms, where we use the higher of the two scores, calculate an offset to the norm of the other string's scores i.e. sincey we're replacing the score(s) in the lower-scoring vector with the higher one in the dot product for the numerator, do the same for the L2-norm product in the denominator. This way we don't accidentally inflate the similarity value simply because e.g. an acronym token was more rare than the same acronym spelled out as multiple individual letters (tend to be low-information/common tokens).
|
2018-01-25 16:32:47 -05:00 |
|
Al
|
d0fe31d359
|
[dedupe] for strict abbreviations (defined as sharing a prefix and a suffix, and containing matches+gaps only by the subtotaling affine gap measure), using the greater of the two scores. This accounts for cases where the abbreviated version may have a much higher weight in one string than the non-abbreviated version does in the other. Same for acronym alignments. Making sure there's a common prefix in regular abbeviation detection Capping the Soft-TFIDF similarity at 1.0.
|
2018-01-25 14:23:18 -05:00 |
|
Al
|
b4cc7395a2
|
[fix] was missing some shorter tokens that are unicode equal in Soft-TFIDF
|
2018-01-25 04:23:35 -05:00 |
|
Al
|
c4aaee7dbf
|
[dedupe/similarity] also utilizing the L2 norm in similarity when acronyms are detected. Similarity in this case should be the acronym token's score * the L2 norm of the expanded tokens' scores in the longer string
|
2018-01-23 01:20:14 -05:00 |
|
Al
|
eb3fb37ad4
|
[similarity/dedupe] normalizing by the product of the L2 norms in soft token similarity function, as in cosine similarity. Score vectors should be passed in unnormalized, and typically with unit length. Also, for aligned phrases that share the same canonical phrase, contribute the product of the two norms of the phrase vectors to the similarity's numerator (maximum value, as if each token in both strings had matched exactly). The previous version over-counted the importance of aligned multi-word phrases by doing a cross product, which could overshadow other more important terms.
|
2018-01-22 01:38:12 -05:00 |
|
Al
|
e935f2a036
|
[fix] need to calculate max Jaro-Winkler for other methods, so only test whether we should use it after we've cycled through all the tokens
|
2018-01-06 03:59:34 -05:00 |
|
Al
|
4356174630
|
[similarity] adding a match count in Soft-TFIDF to allow answering questions about subsets i.e. the set of tokens in "Park Pl" contain the set of tokens in "Park". Setting Jaro-Winkler minimum length of 4 chars on, more specific option name for possible abbeviation detection
|
2018-01-06 03:50:03 -05:00 |
|
Al
|
434bbd4dc2
|
[fix] removing unused vars
|
2017-12-30 02:31:43 -05:00 |
|
Al
|
f1e6886536
|
[similarity/dedupe] adding options for acronym alignments and address phrase matches in Soft-TFIDF. Acronym alignments will give higher similarity to NYU vs. "New York University" whereas phrase matches would match known phrases that share the same canonical like "Cty Rd" vs. "C.R." vs. "County Road" within the Soft-TFIDF similarity calculation.
|
2017-12-29 02:39:49 -05:00 |
|
Al
|
b90c3dab4b
|
[similarity/dedupe] adding Soft-TFIDF implementation with several different fallback qualifiers for the max-sim function (Damerau-Levenshtein and libpostal's new bucketed affine gap method for detecting abbreviations), but keeping Jaro-Winkler as the secondary similarity function in the final distance metric. Overall this should results in higher similarity values when one of the tokens may not quite match the pure secondary threshold in terms of Jaro-Winkler but may match on one of the other criteria.
|
2017-12-28 04:34:46 -05:00 |
|