Commit Graph

24 Commits

Author SHA1 Message Date
Al
3610ffaa05 [expand/dedupe] expansion with multiple languages (or multiple predicted languages) can sometimes produce weird string trees and thus either too many results or incorrect results, particularly for root expansions which we depend on for matching/deduping. Making one call per language identified. This may slightly affect performance on languages that are highly ambiguous (even that's doubtful, as libpostal usually identifies one or two languages with high accurracy and/or people are using a known geography) but should improve the results and was simpler implementation-wise than trying to use the single single string tree for multiple languages where, say, a two word phrase in one language might simply be token-space-token in another. 2019-02-16 22:20:36 -05:00
Al
283be99b44 [numex] helper function to retrieve ordinal suffix lengths from a tokenized string for use in deduping 2018-02-24 00:31:26 -05:00
Al
0f20613c13 [expand] using street name dictionaries as a possible root component instead of having to pollute the synonyms dictionary, which also affects the parser and might be a better place for general purpose synonyms affecting all components. 2018-02-21 22:16:07 -05:00
Al
09408b1075 [fix] for regular expansion, use gazetteer components or overrides 2018-02-15 18:55:37 -05:00
Al
78d621ac85 [fix] adding street type gazetteer to name component as well for things like "24th St Cheese Co" 2018-02-15 18:13:27 -05:00
Al
9c12a11fd7 [fix] check expansion address components for regular expansion, overrides for root expansion 2018-02-15 16:19:50 -05:00
Al
9390e638ae [fix] for regular non-root expansion, check that components are valid (for near-dupe expansions or other cases where component options are passed in) 2018-02-08 13:49:24 -05:00
Al
3a5c048419 [fix] in root expansions, if the current phrase has at least one valid expansion, and the current expansion is not valid, ignore it 2018-02-06 02:36:05 -05:00
Al
0286a2fef3 [expand] for root expansions, delete ambiguous tokens only when there's a non-numeric non-phrase token present. This applies to all name components, not for components where numerics can be the root (house numbers, units, streets, etc.) 2018-01-16 03:02:26 -05:00
Al
c29557c16b [expand] adding another check in root expansions, making sure we don't ignore the unmodified ambiguous phrase 2018-01-08 19:03:50 -05:00
Al
66aee0fffa [expand] make street type dictionaries ignorable for venue names as well (many company names mention their address, so sort of have to apply the same rules) 2018-01-07 01:39:10 -05:00
Gregory Oschwald
2f6749fe03 Fix segfault in expand_alternative_phrase_option
string_tree_get_alternative can return NULL
2018-01-02 13:28:51 -08:00
Al
071aee0e85 [fix] in root expansions, removing phrases that are invalid for the given components if there are other ignorable components 2018-01-02 03:49:52 -05:00
Al
cadf52d19f [fix] making a few internal functions static 2017-12-29 04:50:08 -05:00
Al
cabdbfccd2 [fix] using same order in root expansions 2017-12-28 23:55:41 -05:00
Al
d731339811 [expand] fixing case where too many permutations were getting added for longer strings due to the new-ish ordinal suffix handling, using string_tree_num_tokens instead of string_tree_num_strings throughout to check for previously added words, using new is_likely_roman_numeral API 2017-12-27 21:48:54 -05:00
Al
152761fcbc [expand] adding improvements to root expansions (using possible phrase roots even if they're abbreviated e.g. "E Ctr St", adding special valid components check for root expansions beyond what's stored in the build address dictionaries), removing spaces before checking unique strings, only splitting numeric from alpha in the case of non-ordinals, using cstring_array internally and char ** in the public API 2017-12-25 01:37:42 -05:00
Al
d03ce4e058 [expand] remove blank expansions and strip spaces 2017-12-18 18:17:16 -05:00
Al
f63a9cc579 [expand] adding number phrases as ignorable in PO boxes 2017-12-17 22:12:12 -05:00
Al
727469b736 [expand] no longer delete phrases in cases like "PH 1" for units, where there's a phrase that can accompany numbered units and thus be ignored similar to "Apt 1" but that phrase may also be a qualifier (i.e. Apt 1 and Penthouse 1 are not the same) 2017-12-17 21:57:25 -05:00
Al
a1db4d7734 [expand/normalize] the split_alpha_from_numeric option now applies to both e.g. A1 and 1A since we now strip out ordinal suffixes prior to normalization 2017-12-17 19:53:15 -05:00
Al
9eef46adee [expand] in cases like "Avenue D" where there are two phrases, one is ambiguous (and canonical) but not necessarily edge-ignorable (pre/post-directional), allow deletion of the other token (so "Avenue" in this case). Also allows skipping in cases where the language classifier may predict a second language with some small probability, such as French for a short string like "Avenue D" (in addition to English). If the token was ignorable in the highest probability language, ignore it in both. 2017-12-17 17:24:27 -05:00
Al
3f7abd5b24 [expand] adding a method that allows hash/equality comparisons of addresses like "100 Main" with "100 S Main St." or units like "Apt 101" vs. "#101". Instead of expanding the phrase abbreviations, this version tries its best to delete all but the root words in a string for a specific component. It's probably not perfect, but does handle a number of edge cases related to pre/post directionals in English e.g. "E St" will have a root word of simply "E", "Avenue E" => "E", etc. Also handles a variety of cases where the phrase could be a thoroughfare type but is really a root word such as "Park Pl" or the famous "Avenue Rd". This can be used for near dupe hashing to catch possible dupes for later analysis. Note that it will normalize "St Marks Pl" and "St Marks Ave" to the same thing, which is sometimes warranted (if the user typed the wrong thoroughfare), but can also be reconciled at deduping time. 2017-12-17 15:48:11 -05:00
Al
8968a6c966 [expand] moving expand to its own module so the internal methods can be exposed, calling from libpostal.c 2017-12-08 16:26:13 -05:00