Al
|
c553fe81ee
|
[dedupe] using 4-grams with no edge disambiguation in near dupe hashing of names instead of full tokens (uses the double metaphone for Latin script, 2-grams for ideographic scripts and 4-gram unicode chars for other scripts like Arabic or Cyrllic). The fully concatenated name string with no whitespace + acronyms/subacronyms now also use double-metaphone in Latin script, and are split into 4-grams. Overall this reduces the number of keys, accounts for more misspellings as well as languages with longer words such as German, and various spacing/concatenations differences in general, while still being relatively selective. Most words in Latin scripts will resolve to less than 4 characters, so this mostly affects longer words with many consonants. 4-gram blocking tends to be what's used in the literature, and works well in practice on human and venue names. This is a slight departure from said literature in that we use 4-grams of the phonetic normalization for Latin scripts.
|
2018-01-14 19:12:28 -05:00 |
|
Al
|
f5e41a1f57
|
[fix] logic in sub-acronym generation for near-dupe hashes
|
2018-01-11 13:15:19 -05:00 |
|
Al
|
6ba0403748
|
[dedupe] adding a near-dupe hash for acronyms both with and without stopwords. This will create basic acronyms for institutions like MoMA, UCLA, the NAACP, as well as human initials, etc. It also handles sub-acronyms, so when either at every other non-contiguous stopword (University of Texas at Austin) or punctuation (University of Texas, Austin), it cuts a new sub-acronym (so UT). All of the acronyms for Latin script use a double metaphone as well, so can potentially catch many cases. It does not handle all possible acronyms (e.g. where some of the letters are word-internal as in medical acronyms), but should do relatively well on many common variations.
|
2018-01-10 22:23:40 -05:00 |
|
Al
|
e6edf54adb
|
[dedupe] adding a near-dupe hash for the entire name without spaces.
|
2018-01-08 19:02:19 -05:00 |
|
Al
|
7651a7b9b9
|
[fix] fixing a couple of warnings in dedupe/near_dupe
|
2017-12-31 19:20:17 -05:00 |
|
Al
|
c48c2b778c
|
[dedupe] fixes to near dupe hashing, geohash lengths, cutting off name hashing at 50 unique tokens, fixing memory leaks, checking for valid geo components and returning NULL if one of the required fields isn't present
|
2017-12-30 02:28:38 -05:00 |
|
Al
|
cadf52d19f
|
[fix] making a few internal functions static
|
2017-12-29 04:50:08 -05:00 |
|
Al
|
acfdb50d7c
|
[dedupe] adding near-dupe hashing function, which can be thought of as the blocking function in record linkage or as a form of locally sensitive hashing in general document deduping. The goal is, if two addresses/names are the same, they should share at least one hash. These hashes can also be used as an inverted index (DB, ES, hashtable, etc.). Uses the double metaphone for name words in Latin script (otherwise each individual token, and sequences of two tokens in the case of ideograms for e.g. Chinese, Japanese, Korean, etc.)
|
2017-12-24 02:47:45 -05:00 |
|