Commit Graph

12 Commits

Author SHA1 Message Date
Al
071aee0e85 [fix] in root expansions, removing phrases that are invalid for the given components if there are other ignorable components 2018-01-02 03:49:52 -05:00
Al
cadf52d19f [fix] making a few internal functions static 2017-12-29 04:50:08 -05:00
Al
cabdbfccd2 [fix] using same order in root expansions 2017-12-28 23:55:41 -05:00
Al
d731339811 [expand] fixing case where too many permutations were getting added for longer strings due to the new-ish ordinal suffix handling, using string_tree_num_tokens instead of string_tree_num_strings throughout to check for previously added words, using new is_likely_roman_numeral API 2017-12-27 21:48:54 -05:00
Al
152761fcbc [expand] adding improvements to root expansions (using possible phrase roots even if they're abbreviated e.g. "E Ctr St", adding special valid components check for root expansions beyond what's stored in the build address dictionaries), removing spaces before checking unique strings, only splitting numeric from alpha in the case of non-ordinals, using cstring_array internally and char ** in the public API 2017-12-25 01:37:42 -05:00
Al
d03ce4e058 [expand] remove blank expansions and strip spaces 2017-12-18 18:17:16 -05:00
Al
f63a9cc579 [expand] adding number phrases as ignorable in PO boxes 2017-12-17 22:12:12 -05:00
Al
727469b736 [expand] no longer delete phrases in cases like "PH 1" for units, where there's a phrase that can accompany numbered units and thus be ignored similar to "Apt 1" but that phrase may also be a qualifier (i.e. Apt 1 and Penthouse 1 are not the same) 2017-12-17 21:57:25 -05:00
Al
a1db4d7734 [expand/normalize] the split_alpha_from_numeric option now applies to both e.g. A1 and 1A since we now strip out ordinal suffixes prior to normalization 2017-12-17 19:53:15 -05:00
Al
9eef46adee [expand] in cases like "Avenue D" where there are two phrases, one is ambiguous (and canonical) but not necessarily edge-ignorable (pre/post-directional), allow deletion of the other token (so "Avenue" in this case). Also allows skipping in cases where the language classifier may predict a second language with some small probability, such as French for a short string like "Avenue D" (in addition to English). If the token was ignorable in the highest probability language, ignore it in both. 2017-12-17 17:24:27 -05:00
Al
3f7abd5b24 [expand] adding a method that allows hash/equality comparisons of addresses like "100 Main" with "100 S Main St." or units like "Apt 101" vs. "#101". Instead of expanding the phrase abbreviations, this version tries its best to delete all but the root words in a string for a specific component. It's probably not perfect, but does handle a number of edge cases related to pre/post directionals in English e.g. "E St" will have a root word of simply "E", "Avenue E" => "E", etc. Also handles a variety of cases where the phrase could be a thoroughfare type but is really a root word such as "Park Pl" or the famous "Avenue Rd". This can be used for near dupe hashing to catch possible dupes for later analysis. Note that it will normalize "St Marks Pl" and "St Marks Ave" to the same thing, which is sometimes warranted (if the user typed the wrong thoroughfare), but can also be reconciled at deduping time. 2017-12-17 15:48:11 -05:00
Al
8968a6c966 [expand] moving expand to its own module so the internal methods can be exposed, calling from libpostal.c 2017-12-08 16:26:13 -05:00