Commit Graph

927 Commits

Author SHA1 Message Date
Al
588cf1df86 [build] Changing options to libpostal_data script to allow downloading geodb, uploaded first version to S3 2015-10-11 22:25:37 -05:00
Al
39d3af20cf [build] Checking for shuf/gshuf 2015-10-11 11:13:53 -05:00
Al
372e952cd3 [geodb] Adding some logging to geodb 2015-10-11 01:00:08 -05:00
Al
cb334b9fb1 [geodisambig] Shaving a few hundred more megabytes off of the geodb by only adding a single geohash prefix and not indexing the neighbors (query can use its neighbors) 2015-10-11 00:45:26 -05:00
Al
2394f817e4 [phrases] Fixing fallback at the end of a string in trie search 2015-10-11 00:13:21 -05:00
Al
29bc0fd11e [build] Makefile changes for the new geodb 2015-10-09 15:54:44 -04:00
Al
a6fbd48bec [geodb] geodb builder changes to support the new, more compact geodb 2015-10-09 15:53:56 -04:00
Al
bf596b9184 [utils] integer string sizes 2015-10-09 15:40:47 -04:00
Al
4dad121334 [fix] Initializing booleans in postal code constructor 2015-10-09 15:40:28 -04:00
Al
44da2e446b [geodb] Additional filenames and struct members in geodb.h 2015-10-09 15:37:10 -04:00
Al
67d128c386 [graph] graph_load and graph_save 2015-10-09 15:36:14 -04:00
Al
9fe2250521 [geodb] Using a trie for geo disambiguation features rather than the sparkey hashtable, sparkey simply contains the ids or code/country pairs in the case of postal codes 2015-10-09 15:35:50 -04:00
Al
cd6a0ab90b [geodb] Prefixing features with name for geo disambiguation (better trie compression) and removing the longer geohash prefix features 2015-10-09 15:16:08 -04:00
Al
77c4bb10c6 [utils] Adding kh_foreach_key 2015-10-09 11:51:32 -04:00
Al
151161cab3 [fix] Raising error in geonames output if a country cannot be localized 2015-10-07 03:45:56 -04:00
Al
1917816b80 [countries] Not relying on pycountry alpha 2 codes for localized country names as it doesn't contain Kosovo which was causing problems 2015-10-07 03:44:49 -04:00
Al
1e98932b82 [fix] setting array->n after reading in both graph and sparse_matrix implementations 2015-10-06 19:28:28 -04:00
Al
5a231fb709 [graph] Builder for graphs not constructed in vertex-sorted order 2015-10-06 19:03:10 -04:00
Al
4984352eda [graph] Simple sparse graph implementation, essentially a sparse matrix with no values array 2015-10-06 18:58:18 -04:00
Al
3084fc929b [geodb] Was missing country boundary type in GeoDB causing some misses in parsing 2015-10-06 16:01:22 -04:00
Al
5af6dc77d1 [dictionaries] Adding a few additional abbreviated names of political leaders that come up, a missing abbreviation 2015-10-06 15:09:50 -04:00
Al
5f03bc9369 [fix] Unit dictionaries apply to ADDRESS_UNIT component 2015-10-06 12:04:31 -04:00
Al
91f4e477ad [fix] typo 2015-10-06 12:04:07 -04:00
Al
0eb9ef5bdf [tokenization] Regenerating scanner.c 2015-10-05 01:41:48 -04:00
Al
50a36cc595 [parser] using trie_new_from_hash instead of an inline implemention in averaged perceptron training 2015-10-04 18:31:16 -04:00
Al
ff8986a287 [phrases] trie_new_from_hash compresses a {str: uint32_t} hashtable into a trie in sorted order 2015-10-04 18:28:21 -04:00
Al
55a5a79b4b [tokenization] tokenized string with source 2015-10-04 18:27:04 -04:00
Al
aa39c45b87 [tokenization] skipping control characters in tokenization, comes up in OSM surprisingly 2015-10-04 18:25:50 -04:00
Al
d6480d2902 [utils] Adding ksort for strings by default in collections.h 2015-10-04 18:23:42 -04:00
Al
db63e6dbc3 [fix] making ksort methods static 2015-10-04 18:23:09 -04:00
Al
ed51fce291 [fix] Safe to assume Bokmål for Norwegian street addresses 2015-10-04 11:19:43 -04:00
Al
cfa57c96a3 [fix] untagged formatted addresses 2015-10-04 02:02:59 -04:00
Al
89d0fd5718 [fix] Alpha-numeric splitting 2015-10-03 16:40:10 -04:00
Al
6428c0ae20 [utils] cstring_array_cat 2015-10-03 16:00:13 -04:00
Al
5d2a24872a [osm] Adding dependencies so single street names are not valid without at least one of {house, number, suburb, city, postcode} 2015-10-03 15:22:26 -04:00
Al
77be2fe433 [osm] Adjusting priors for country code expansion 2015-10-03 15:13:16 -04:00
Al
0b98a26426 [fix] keeping name tag in address components 2015-10-03 15:10:14 -04:00
Al
0f9ad259dc [osm] Doing initial formatting after replacing country/state 2015-10-03 14:40:38 -04:00
Al
71233c9c02 [fix] import, initialization 2015-10-03 14:37:08 -04:00
Al
85b17d9b27 [fix] file encoding 2015-10-03 14:34:29 -04:00
Al
1948aa87ea [fix] typo 2015-10-03 14:33:45 -04:00
Al
22efce7337 [osm/parsing] Randomly replacing country codes with local and foreign language expansions as well as randomly expanding state abbreviations to make parser more robust to different input 2015-10-03 14:31:51 -04:00
Al
8920812055 [expansion] Adding state abbreviations for US, Canada and Australia for expansion while generating OSM training data 2015-10-03 14:25:30 -04:00
Al
7eb18f3538 [languages] Function to sample a random language from a discrete distribution (e.g. languages on the Internet, languages in a country, etc.) 2015-10-03 13:20:23 -04:00
Al
0aa6950b6c [fix] abbreviations 2015-10-02 23:48:21 -04:00
Al
db71b65412 [fix] checking validity of component combination 2015-10-02 20:28:45 -04:00
Al
a2fd6e25f8 [fix] import 2015-10-02 20:25:48 -04:00
Al
49abb70b59 [fix] dictionary 2015-10-02 20:24:21 -04:00
Al
521f33d892 [fix] bitset for address components, only looking at valid component keys 2015-10-02 20:21:59 -04:00
Al
528285f735 [fix] only OSM tagged addresses need extra logic 2015-10-02 20:18:30 -04:00