17 Commits

Author SHA1 Message Date
Al
0540d7c7e3 [api/compat] PR #465 redefined the language classifier response struct in the API and was casting between incompatible pointer types. Using the exported struct throughout. 2025-01-30 01:45:18 -05:00
Al
893745f09b [near_dupes] using quadgrams in Latin scripts as well for near dupe hashes 2022-03-25 14:05:03 -04:00
Al
b7052caf6b [dedupe] allow near-dupe hashes if only a small containing boundary is present (e.g. county/state district). 2019-02-16 22:26:13 -05:00
Al
835de327c3 [dedupe] for near-dupe hashing, remove whitespace from root expansions so something like "Ocean Walk Dr" and "Oceanwalk Dr" will have a chance of matching downstream 2018-02-24 00:34:09 -05:00
Al
156c8bed40 [fix] check that second double metaphone alternative is not the empty string 2018-02-06 03:08:37 -05:00
Al
4d3619d493 [dedupe] moving name-only near-dupe hashes to a separate if block so they can be used in conjunction with name+address 2018-02-03 14:03:17 -05:00
Al
3c5713ef59 [fix] check for sub-acronyms with no stopwords in near-dupe hashing 2018-01-21 17:49:46 -05:00
Al
7121642c62 [dedupe] fixing sub-acronym near-dupe hashes with punctuation, and making sure to add the current token after a new sub-acronym has been cut 2018-01-18 00:11:21 -05:00
Al
03e5e25240 [dedupe] adding a near-dupe hash which takes into account existing acronyms which may have appeared in the string, either known acronyms as defined in the dictionaries like "HS" and includes the full token in the acronym. This feature is particularly useful for public schools or other cases where the canonical string may be used i.e. "Foo High School", "Foo HS" and "FHS". It also does the same thing other acronyms that are identified by the tokenizer from the internal period structure like A.B.C. Also now allowing mixed alpha-numeric tokens to use the double metaphone encoding as well, and for numeric tokens with script=Common (digits but may also contain hyphens, etc.), the full token is included as one of the words rather than quadgrams, which don't make sense for numerics. 2018-01-16 03:38:22 -05:00
Al
c553fe81ee [dedupe] using 4-grams with no edge disambiguation in near dupe hashing of names instead of full tokens (uses the double metaphone for Latin script, 2-grams for ideographic scripts and 4-gram unicode chars for other scripts like Arabic or Cyrllic). The fully concatenated name string with no whitespace + acronyms/subacronyms now also use double-metaphone in Latin script, and are split into 4-grams. Overall this reduces the number of keys, accounts for more misspellings as well as languages with longer words such as German, and various spacing/concatenations differences in general, while still being relatively selective. Most words in Latin scripts will resolve to less than 4 characters, so this mostly affects longer words with many consonants. 4-gram blocking tends to be what's used in the literature, and works well in practice on human and venue names. This is a slight departure from said literature in that we use 4-grams of the phonetic normalization for Latin scripts. 2018-01-14 19:12:28 -05:00
Al
f5e41a1f57 [fix] logic in sub-acronym generation for near-dupe hashes 2018-01-11 13:15:19 -05:00
Al
6ba0403748 [dedupe] adding a near-dupe hash for acronyms both with and without stopwords. This will create basic acronyms for institutions like MoMA, UCLA, the NAACP, as well as human initials, etc. It also handles sub-acronyms, so when either at every other non-contiguous stopword (University of Texas at Austin) or punctuation (University of Texas, Austin), it cuts a new sub-acronym (so UT). All of the acronyms for Latin script use a double metaphone as well, so can potentially catch many cases. It does not handle all possible acronyms (e.g. where some of the letters are word-internal as in medical acronyms), but should do relatively well on many common variations. 2018-01-10 22:23:40 -05:00
Al
e6edf54adb [dedupe] adding a near-dupe hash for the entire name without spaces. 2018-01-08 19:02:19 -05:00
Al
7651a7b9b9 [fix] fixing a couple of warnings in dedupe/near_dupe 2017-12-31 19:20:17 -05:00
Al
c48c2b778c [dedupe] fixes to near dupe hashing, geohash lengths, cutting off name hashing at 50 unique tokens, fixing memory leaks, checking for valid geo components and returning NULL if one of the required fields isn't present 2017-12-30 02:28:38 -05:00
Al
cadf52d19f [fix] making a few internal functions static 2017-12-29 04:50:08 -05:00
Al
acfdb50d7c [dedupe] adding near-dupe hashing function, which can be thought of as the blocking function in record linkage or as a form of locally sensitive hashing in general document deduping. The goal is, if two addresses/names are the same, they should share at least one hash. These hashes can also be used as an inverted index (DB, ES, hashtable, etc.). Uses the double metaphone for name words in Latin script (otherwise each individual token, and sequences of two tokens in the case of ideograms for e.g. Chinese, Japanese, Korean, etc.) 2017-12-24 02:47:45 -05:00