Commit Graph

48 Commits

Author SHA1 Message Date
Al
ab67e0864a [dictionaries] adding new street_names.txt dictionary and movign all the synonyms to there, generating the new dictionary type in address_dictionaries.py 2018-02-21 22:13:51 -05:00
Al
85b402063b [fix] escape literal backslash in address dictionaries 2016-12-24 16:05:45 -05:00
Al
cca80b046c [abbreviation] fixing abbreviations within hyphenated phrases, particularly for prefix/suffix matches 2016-12-03 17:55:11 -05:00
Al
e15036fcce [fix] if there are street types that are not venue words and not vice versa, then call the venue invalid as a standalone term 2016-11-19 04:11:33 -05:00
Al
5140db536a [phrases] additions to venue names dictionaries and a more restrictive version of street types dictionaries 2016-11-19 02:58:27 -05:00
Al
71be0fdfbc [fix] sets 2016-11-19 02:30:40 -05:00
Al
b6f7b5b577 [fix] name 2016-11-19 01:38:15 -05:00
Al
1df1b60a9f [phrases] adding extract_phrases method to gazetteers, which returns a set of gazetteer phrases found in a given string 2016-11-18 23:35:44 -05:00
Al
1d25f08b52 [expand] adding a function to check if two place names/addresses are equivalent after token normalization (replacing hyphens, deleting final periods, lowercasing, simple transliteration, etc.) and taking into account abbreviations from any specified libpostal dictionaries. In conjunction with place name affixes, useful in data sets like GeoPlanet or GeoNames to determine if a name variant is related to the original or not 2016-10-12 14:55:59 -04:00
Al
14c20091f4 [fix] abbreviations in hyphenated phrases like Saint-Germaine. Hyphenation should use the phrase length not the token length 2016-09-12 22:20:25 -04:00
Al
551cce8cb1 [fix] making a separate gazetteer for toponym abbreviations 2016-09-10 01:08:58 -04:00
Al
bae04eb543 [fix] int 2016-08-28 14:11:25 -04:00
Al
de0a7bfe4f [fix] /or/and/ 2016-08-28 14:09:30 -04:00
Al
44e59e8daf [fix] return the original for already abbreviated tokens 2016-08-28 14:05:58 -04:00
Al
3cf3e401db [fix] abbreviation recasing 2016-08-28 12:04:36 -04:00
Al
2e7f8f1ae7 [abbreviations] Adding toponyms gazetteer for probabilistically abbreviating things like Mount=>Mt, Saint=>St, Fort=>Ft in place names 2016-08-24 18:52:00 -04:00
Al
dfa5c8e0a6 [abbreviations] Adding ability to abbreviate within hyphenated phrases e.g. Sint-Maarten => St.-Maarten 2016-08-24 18:50:24 -04:00
Al
8b57a7acf2 [osm] abbreviate toponyms (qualifiers) with some probability so we get those versions in the model's phrase dictionaries 2016-08-22 20:55:35 -04:00
Al
dd7ef6fabf [dictionaries] Making new component for near/nearby prepositions 2016-07-21 17:04:57 -04:00
Al
9561f771ce [dictionaries] Adding new dictionary types to generator script 2016-07-21 17:04:57 -04:00
Al
4e4686fbfe [gazetteers] Street and synonym dictionary for catching other abbreviations that occur in street names 2016-07-21 17:04:57 -04:00
Al
38607b0a50 [fix] var name for error case 2016-07-21 17:04:57 -04:00
Al
b50120f45c [chains] Adding chains gazetteer 2016-07-21 17:04:57 -04:00
Al
771a360a85 [phrases] Using safe_encode/safe_decode as default trie serializer/deserializer 2016-07-21 17:04:57 -04:00
Al
3a9ac9d96f [fix] six.u 2016-07-21 17:04:57 -04:00
Al
7b42e52c6a [fix] token_types.PHRASE 2016-07-21 17:04:57 -04:00
Al
d5dc34ec1d [gazetteers] moving PHRASE to a token type 2016-07-21 17:04:57 -04:00
Al
62748b4644 [dictionaries] /house_number/house_numbers/ 2016-07-21 17:04:57 -04:00
Al
6d4e54cd7a [dictionaries] making entrances/postcodes plural for consistency 2016-07-21 17:04:57 -04:00
Al
410eb0006a [dictionaries] Moving intersections to cross streets 2016-07-21 17:04:57 -04:00
Al
2f9a58f37b [expansion] Add postcode dictionary to gazetteer types 2016-07-21 17:04:57 -04:00
Al
e1f1e34dca [expansion] Modifying the Python gazetteers to use new dictionaries API 2016-07-21 17:04:57 -04:00
Al
80089099e9 [expansion] Adding number and intersections to dictionary types 2016-07-21 17:04:57 -04:00
Al
3d3aacae67 [addresses] Adding abbreviations as a separate module so it can be used with multiple data sets 2016-07-21 17:04:57 -04:00
Al
9dd5d5c210 [dictionaries] encapsulating reading address dictionaries so it's easy to implement sampling for the address training data 2016-07-21 17:04:57 -04:00
Al
f3a9f4a257 [fix] removing init_gazetteers, doing it at the module level 2016-07-21 17:04:57 -04:00
Al
0162194dbc [dictionaries] Adding dictionary type enums to the generator script 2016-07-21 17:04:57 -04:00
Al
18e2c7519e [fix] Absolute dir check in generating expansion data files 2016-03-13 23:23:46 -04:00
Al
1003832b9c [fix] README should not be included in building address dictionaries 2016-03-09 11:18:19 -05:00
Al
52ebc9fc46 [fix] Paths relative to the current file in address_dictionaries.py so it can be run from anywhere 2016-02-24 13:10:44 -05:00
Al
b22646ee30 [mv] Moving gazetteers into their own module 2016-01-22 03:15:56 -05:00
Al
35db855819 [fix] canonical index in address expansion data, should be -1 for all canonical phrases 2015-12-08 15:09:51 -05:00
Al
a5ce1f12dd [fix] stdint header in address expansion rule generation script 2015-08-08 23:28:11 -04:00
Al
b27af13f8a [expansion] Adding an array of dictionaries to each (phrase, canonical) pair 2015-07-22 20:24:14 -04:00
Al
64a63fdf51 [mv] Moving all repo data files to a resources dir, data is only for runtime files 2015-07-21 18:11:36 -04:00
Al
7f67ed7dc0 [fix] less ambiguous variable name in the generated expansions data file 2015-07-20 02:58:26 -04:00
Al
b9103a39fa [expansion] Moving filename=>dictionary type mapping to the Python generation script and validating there 2015-07-16 03:51:11 -04:00
Al
f181c04e7a [expansion] expansion rule structs and Python script to generate rules from dictionaries tree. Note that a canonical_index of -1 indicates that a given phrase is the canonical (saves space) 2015-07-16 02:49:53 -04:00