cfa57c96a3[fix] untagged formatted addresses
Al
2015-10-04 02:02:59 -04:00
89d0fd5718[fix] Alpha-numeric splitting
Al
2015-10-03 16:40:06 -04:00
6428c0ae20[utils] cstring_array_cat
Al
2015-10-03 16:00:13 -04:00
5d2a24872a[osm] Adding dependencies so single street names are not valid without at least one of {house, number, suburb, city, postcode}
Al
2015-10-03 15:22:21 -04:00
77be2fe433[osm] Adjusting priors for country code expansion
Al
2015-10-03 15:13:16 -04:00
0b98a26426[fix] keeping name tag in address components
Al
2015-10-03 15:10:14 -04:00
0f9ad259dc[osm] Doing initial formatting after replacing country/state
Al
2015-10-03 14:40:38 -04:00
71233c9c02[fix] import, initialization
Al
2015-10-03 14:37:08 -04:00
85b17d9b27[fix] file encoding
Al
2015-10-03 14:34:29 -04:00
1948aa87ea[fix] typo
Al
2015-10-03 14:33:45 -04:00
22efce7337[osm/parsing] Randomly replacing country codes with local and foreign language expansions as well as randomly expanding state abbreviations to make parser more robust to different input
Al
2015-10-03 14:31:51 -04:00
8920812055[expansion] Adding state abbreviations for US, Canada and Australia for expansion while generating OSM training data
Al
2015-10-03 14:25:30 -04:00
7eb18f3538[languages] Function to sample a random language from a discrete distribution (e.g. languages on the Internet, languages in a country, etc.)
Al
2015-10-03 13:20:19 -04:00
0aa6950b6c[fix] abbreviations
Al
2015-10-02 23:48:21 -04:00
db71b65412[fix] checking validity of component combination
Al
2015-10-02 20:28:45 -04:00
a2fd6e25f8[fix] import
Al
2015-10-02 20:25:48 -04:00
49abb70b59[fix] dictionary
Al
2015-10-02 20:24:21 -04:00
521f33d892[fix] bitset for address components, only looking at valid component keys
Al
2015-10-02 20:21:52 -04:00
528285f735[fix] only OSM tagged addresses need extra logic
Al
2015-10-02 20:18:30 -04:00
83aecb9f2c[osm/parsing] Making tagged training data for address parser more robust to the types of partial input we see in geocoding by randomly eliminating components subject to some constraints (e.g. house number cannot be used without a street name)
Al
2015-10-02 19:52:13 -04:00
c790a2b87f[fix] spoken/official
Al
2015-10-02 19:50:11 -04:00
db3364be30[geonames] Using official country languages in GeoNames
Al
2015-10-01 00:45:34 -04:00
01856dd36d[fix] acronyms
Al
2015-10-01 00:24:04 -04:00
562aeb497d[tokenization] Regenerating scanner.c
Al
2015-09-30 11:32:38 -04:00
689b830ad2[tokenization] Acronym vs abbreviation
Al
2015-09-30 04:10:04 -04:00
7dfbcce9ec[languages] options for get_country_languages
Al
2015-09-30 04:09:07 -04:00
86e9166ae8[doc] doumentation for country_names module, fixing variable name
Al
2015-09-30 03:08:04 -04:00
42e77cb570[countries] Making country official names align better with OSM/Wikipedia, plugging holes
Al
2015-09-30 01:02:59 -04:00
0cedc68a97[languages] Changing Arabic to default in North African countries with two official languages. Making Danish secondary in the US Virgin Islands
Al
2015-09-30 01:01:42 -04:00
40cf247655[formatting] Constants for field names, a few options in format_address
Al
2015-09-29 23:03:37 -04:00
22e8178a97[countries] Adding module for getting official country names in every language from CLDR + a dictionary of local language names
Al
2015-09-29 21:08:52 -04:00
c3c6a18df8[geodb] Renaming geodb
Al
2015-09-29 13:07:50 -04:00
8ca22247f9[fix] labels in averaged perceptron trainer
Al
2015-09-29 13:07:07 -04:00
6666f0baf8[fix] Labels in averaged perceptron tagger
Al
2015-09-29 13:06:34 -04:00
05da2ee6bd[dictionaries] Adding commonly used colon form No: for Turkish addresses
Al
2015-09-28 17:48:19 -04:00
daad1a1313[geonames] Removing alternate names from geonames data set which are digits-only (most are not legitimate)
Al
2015-09-28 17:46:53 -04:00
12816d0e95[api] Setting global objects to NULL on teardown
Al
2015-09-28 17:27:57 -04:00
abfa744d59[build] Adding libpostal_data script for downloading data from S3, Makefile uses that now as part of the all-local target. Can be run periodically after install
Al
2015-09-28 17:26:11 -04:00
f29f2f091b[fix] PEBCAK
Al
2015-09-27 22:49:27 -04:00
93b3110a49[fix] only commas and hyphens need to be eliminated at the end of phrases in untagged address formatting
Al
2015-09-27 19:25:28 -04:00
d3bfaf6b43[osm/formatting] Fixing formatting tagged addresses with comma separated fields
Al
2015-09-27 03:19:23 -04:00
d512201e2c[fix] removing space from tokens in address formatting
Al
2015-09-27 02:18:34 -04:00
a3214b7914[readme] Readme fixes and additions
Al
2015-09-26 23:32:19 -04:00
5b829cd5a7[fix] blank values containing punctuation in formatting
Al
2015-09-26 21:49:28 -04:00
dac0440be8[fix] rsplit
Al
2015-09-26 21:07:54 -04:00
e255ae0e09[dictionaries] Luxembourgish dictionaries
Al
2015-09-26 18:31:07 -04:00
3fe56d029d[dictionaries] German Swiss dictionaries
Al
2015-09-26 18:30:55 -04:00
ae93552455[osm/formatting] Moving back to openvenues repo pending resolution of the Turkish address issue
Al
2015-09-26 03:56:50 -04:00
0c792a2cc3[osm/formatting] Changing the way the formatter elimiates inter-component separators, changing repo back to OpenCageData after pull request merge
Al
2015-09-26 03:21:26 -04:00
856198a352[tokenization] Regenerated scanner.c
Al
2015-09-26 02:27:45 -04:00
07f1f361e2[transliteration] Regenerating transliteration data with new categories
Al
2015-09-26 00:07:39 -04:00
172263af58[tokenization] Adding updated token classes to scanner.re
Al
2015-09-26 00:05:23 -04:00
5417b4e602[unicode] Downloading latest UnicodeData.txt instead of using builtin Python module (out of date) e.g. for getting unicode codepoint categories
Al
2015-09-25 23:59:38 -04:00
8fe791a14a[fix] ensure_dir in file downloads
Al
2015-09-25 17:05:22 -04:00
646b9f7248[osm/formatting] Continuing to use openvenues formatter for the India fix
Al
2015-09-25 13:35:54 -04:00
5a6b47d0fd[api] Adding LIBPOSTAL_DEFAULT_OPTIONS to libpostal.h
Al
2015-09-25 01:53:29 -04:00
f5bb72c6f5[readme] missed a dictionary type
Al
2015-09-24 23:32:36 -04:00
f243b9cfa6[fix] phrasing
Al
2015-09-24 23:21:28 -04:00
dc31019604[readme] Heading
Al
2015-09-24 23:20:23 -04:00
cfef3059bb[readme] Moving paragraph
Al
2015-09-24 23:19:53 -04:00
f62cfb9551[readme] README changes
Al
2015-09-24 23:16:07 -04:00
3e256404b9[readme] More informative README
Al
2015-09-24 23:02:09 -04:00
9901dd2aac[fix] Switching address formatter back to OpenCageData repo
Al
2015-09-24 18:42:17 -04:00
accd8a57e7[expansion] Regenerating expansion data
Al
2015-09-24 16:38:16 -04:00
fa320defb7[dictionaries] Afrikaans dictionaries for better disambiguatin in South Africa
Al
2015-09-24 16:37:16 -04:00
050a850fb9[dictionaries] Dutch directionals, separating out the west vs westen forms
Al
2015-09-24 16:36:52 -04:00
fe5d665533[dictionaries] Arc in English needn't always expand to Arcade
Al
2015-09-24 16:36:21 -04:00
bcac6a41be[dictionaries] Separating out Austrian toponym abbreviations
Al
2015-09-24 16:35:56 -04:00
3ce1669c30[fix] import
Al
2015-09-24 01:25:00 -04:00
c85ce0b11d[osm/formatting] Tagging separators as well in tagged output of the address formatter
Al
2015-09-24 01:22:49 -04:00
f6c30778bf[normalize] New token normalization option for replacing digits with 'D' for masking numbers e.g. when learning patterns (so 1234 and 5678 both normalize to DDDD). Shouldn't be used by libpostal API, just by the feature extractors in the machine learning models. Also adding better possessive handling.
Al
2015-09-23 19:40:51 -04:00
a1d272077d[doc] Averaged perceptron tagger
Al
2015-09-23 19:37:55 -04:00
4a0da67aa1[fix] warning
Al
2015-09-23 04:06:54 -04:00
88bd0cd158[unicode] better segmentation on script breaks
Al
2015-09-23 04:06:34 -04:00
377c947541[transliteration] Regenerating transliteration data files
Al
2015-09-23 04:04:38 -04:00
abfb1d4a60[transliteration] Wide char support in transliteration data generator
Al
2015-09-23 03:56:12 -04:00
7e057b0fb8[utils] basic functions for wide char support for narrow Python builds (unichr, ord, unicode iteration)
Al
2015-09-23 00:42:48 -04:00
8562c7a5cb[unicode] Adding wide char support for language disambiguation (comes up in venue names), despite the likelihood of running on a narrow Python build. Rolling back common script chars at a script break, so in the case of e.g. Cyrllic name (Latin name), the segmentation is done at the space before the paren.
Al
2015-09-23 00:37:53 -04:00
19e5457a0f[unicode] Regenerated unicode scripts data file, using simple integers instead of repeating the enum types for succinctness
Al
2015-09-23 00:36:24 -04:00
4ad3fac627[unicode] Regenerated unicode script types (ignore extraneous scripts, they're not used, just reside in the upper unicode planes)
Al
2015-09-23 00:34:58 -04:00
13bcc35523[unicode] Allowing wide chars in unicode properties
Al
2015-09-23 00:34:07 -04:00
f13e9fad90[tokenization] Regenerated scanner.c
Al
2015-09-23 00:33:27 -04:00
b4593b6f88[unicode/tokenization] Using new character classes including wide chars in scanner
Al
2015-09-23 00:33:14 -04:00
a76831df7a[unicode] Wide version of word breaks
Al
2015-09-22 18:55:33 -04:00
25917cfb17[fix] scripts
Al
2015-09-22 15:15:30 -04:00
b405a53fe1[fix] chars out of range in get_string_script Python version
Al
2015-09-22 08:14:27 -04:00
ca25b48687[fix] Not writing empty fields in formatted addresses
Al
2015-09-22 08:13:55 -04:00
747de1944b[fix] Accounting for unknown scripts in disambiguation
Al
2015-09-21 18:05:28 -04:00
3ac89d7ed9[setup] fixing packaging
Al
2015-09-21 17:31:15 -04:00
236737eab3[tokenization/osm] Using utf8 encoded version of string for tokens in python tokenizer
Al
2015-09-21 17:27:43 -04:00
134cf616d6[osm] Using street for language disambiguation in training data
Al
2015-09-21 04:09:15 -04:00
ccac4a5a90[fix] package directory
Al
2015-09-21 03:50:05 -04:00
5f912ddcd3[fix] std=c99
Al
2015-09-21 03:25:32 -04:00
5b2fd0be50[fix] pytokenize compilation on Ubuntu/gcc
Al
2015-09-21 03:24:14 -04:00
cffa5a4a20[fix] stdint include
Al
2015-09-20 20:10:47 -04:00
25b3338600[setup] setup.py for pypostal so it can be installed from the Github url
Al
2015-09-20 20:07:59 -04:00
84cf21df88[osm] Separating address formatter into its own module, adding some documentation of the various training sets with examples
Al
2015-09-20 19:23:13 -04:00
5485ea2197[python] Adding initial pypostal bindings for tokenize so we can remove address_normalizer dependency. Not tested on Python 3.
Al
2015-09-20 14:59:33 -04:00
3fab0f984f[fix] fixing some compiler warnings, using type-specific abs functions for vector_math
Al
2015-09-19 16:10:47 -04:00
6731395ca0[osm] Separating tagged from untagged output
Al
2015-09-19 14:11:47 -04:00