libpostal

tommy/libpostal

Fork 0

cfa57c96a3 [fix] untagged formatted addresses Al 2015-10-04 02:02:59 -04:00
89d0fd5718 [fix] Alpha-numeric splitting Al 2015-10-03 16:40:06 -04:00
6428c0ae20 [utils] cstring_array_cat Al 2015-10-03 16:00:13 -04:00
5d2a24872a [osm] Adding dependencies so single street names are not valid without at least one of {house, number, suburb, city, postcode} Al 2015-10-03 15:22:21 -04:00
77be2fe433 [osm] Adjusting priors for country code expansion Al 2015-10-03 15:13:16 -04:00
0b98a26426 [fix] keeping name tag in address components Al 2015-10-03 15:10:14 -04:00
0f9ad259dc [osm] Doing initial formatting after replacing country/state Al 2015-10-03 14:40:38 -04:00
71233c9c02 [fix] import, initialization Al 2015-10-03 14:37:08 -04:00
85b17d9b27 [fix] file encoding Al 2015-10-03 14:34:29 -04:00
1948aa87ea [fix] typo Al 2015-10-03 14:33:45 -04:00
22efce7337 [osm/parsing] Randomly replacing country codes with local and foreign language expansions as well as randomly expanding state abbreviations to make parser more robust to different input Al 2015-10-03 14:31:51 -04:00
8920812055 [expansion] Adding state abbreviations for US, Canada and Australia for expansion while generating OSM training data Al 2015-10-03 14:25:30 -04:00
7eb18f3538 [languages] Function to sample a random language from a discrete distribution (e.g. languages on the Internet, languages in a country, etc.) Al 2015-10-03 13:20:19 -04:00
0aa6950b6c [fix] abbreviations Al 2015-10-02 23:48:21 -04:00
db71b65412 [fix] checking validity of component combination Al 2015-10-02 20:28:45 -04:00
a2fd6e25f8 [fix] import Al 2015-10-02 20:25:48 -04:00
49abb70b59 [fix] dictionary Al 2015-10-02 20:24:21 -04:00
521f33d892 [fix] bitset for address components, only looking at valid component keys Al 2015-10-02 20:21:52 -04:00
528285f735 [fix] only OSM tagged addresses need extra logic Al 2015-10-02 20:18:30 -04:00
83aecb9f2c [osm/parsing] Making tagged training data for address parser more robust to the types of partial input we see in geocoding by randomly eliminating components subject to some constraints (e.g. house number cannot be used without a street name) Al 2015-10-02 19:52:13 -04:00
c790a2b87f [fix] spoken/official Al 2015-10-02 19:50:11 -04:00
db3364be30 [geonames] Using official country languages in GeoNames Al 2015-10-01 00:45:34 -04:00
01856dd36d [fix] acronyms Al 2015-10-01 00:24:04 -04:00
562aeb497d [tokenization] Regenerating scanner.c Al 2015-09-30 11:32:38 -04:00
689b830ad2 [tokenization] Acronym vs abbreviation Al 2015-09-30 04:10:04 -04:00
7dfbcce9ec [languages] options for get_country_languages Al 2015-09-30 04:09:07 -04:00
86e9166ae8 [doc] doumentation for country_names module, fixing variable name Al 2015-09-30 03:08:04 -04:00
42e77cb570 [countries] Making country official names align better with OSM/Wikipedia, plugging holes Al 2015-09-30 01:02:59 -04:00
0cedc68a97 [languages] Changing Arabic to default in North African countries with two official languages. Making Danish secondary in the US Virgin Islands Al 2015-09-30 01:01:42 -04:00
40cf247655 [formatting] Constants for field names, a few options in format_address Al 2015-09-29 23:03:37 -04:00
22e8178a97 [countries] Adding module for getting official country names in every language from CLDR + a dictionary of local language names Al 2015-09-29 21:08:52 -04:00
c3c6a18df8 [geodb] Renaming geodb Al 2015-09-29 13:07:50 -04:00
8ca22247f9 [fix] labels in averaged perceptron trainer Al 2015-09-29 13:07:07 -04:00
6666f0baf8 [fix] Labels in averaged perceptron tagger Al 2015-09-29 13:06:34 -04:00
05da2ee6bd [dictionaries] Adding commonly used colon form No: for Turkish addresses Al 2015-09-28 17:48:19 -04:00
daad1a1313 [geonames] Removing alternate names from geonames data set which are digits-only (most are not legitimate) Al 2015-09-28 17:46:53 -04:00
12816d0e95 [api] Setting global objects to NULL on teardown Al 2015-09-28 17:27:57 -04:00
abfa744d59 [build] Adding libpostal_data script for downloading data from S3, Makefile uses that now as part of the all-local target. Can be run periodically after install Al 2015-09-28 17:26:11 -04:00
f29f2f091b [fix] PEBCAK Al 2015-09-27 22:49:27 -04:00
93b3110a49 [fix] only commas and hyphens need to be eliminated at the end of phrases in untagged address formatting Al 2015-09-27 19:25:28 -04:00
d3bfaf6b43 [osm/formatting] Fixing formatting tagged addresses with comma separated fields Al 2015-09-27 03:19:23 -04:00
d512201e2c [fix] removing space from tokens in address formatting Al 2015-09-27 02:18:34 -04:00
a3214b7914 [readme] Readme fixes and additions Al 2015-09-26 23:32:19 -04:00
5b829cd5a7 [fix] blank values containing punctuation in formatting Al 2015-09-26 21:49:28 -04:00
dac0440be8 [fix] rsplit Al 2015-09-26 21:07:54 -04:00
e255ae0e09 [dictionaries] Luxembourgish dictionaries Al 2015-09-26 18:31:07 -04:00
3fe56d029d [dictionaries] German Swiss dictionaries Al 2015-09-26 18:30:55 -04:00
ae93552455 [osm/formatting] Moving back to openvenues repo pending resolution of the Turkish address issue Al 2015-09-26 03:56:50 -04:00
0c792a2cc3 [osm/formatting] Changing the way the formatter elimiates inter-component separators, changing repo back to OpenCageData after pull request merge Al 2015-09-26 03:21:26 -04:00
856198a352 [tokenization] Regenerated scanner.c Al 2015-09-26 02:27:45 -04:00
07f1f361e2 [transliteration] Regenerating transliteration data with new categories Al 2015-09-26 00:07:39 -04:00
172263af58 [tokenization] Adding updated token classes to scanner.re Al 2015-09-26 00:05:23 -04:00
5417b4e602 [unicode] Downloading latest UnicodeData.txt instead of using builtin Python module (out of date) e.g. for getting unicode codepoint categories Al 2015-09-25 23:59:38 -04:00
8fe791a14a [fix] ensure_dir in file downloads Al 2015-09-25 17:05:22 -04:00
646b9f7248 [osm/formatting] Continuing to use openvenues formatter for the India fix Al 2015-09-25 13:35:54 -04:00
5a6b47d0fd [api] Adding LIBPOSTAL_DEFAULT_OPTIONS to libpostal.h Al 2015-09-25 01:53:29 -04:00
f5bb72c6f5 [readme] missed a dictionary type Al 2015-09-24 23:32:36 -04:00
f243b9cfa6 [fix] phrasing Al 2015-09-24 23:21:28 -04:00
dc31019604 [readme] Heading Al 2015-09-24 23:20:23 -04:00
cfef3059bb [readme] Moving paragraph Al 2015-09-24 23:19:53 -04:00
f62cfb9551 [readme] README changes Al 2015-09-24 23:16:07 -04:00
3e256404b9 [readme] More informative README Al 2015-09-24 23:02:09 -04:00
9901dd2aac [fix] Switching address formatter back to OpenCageData repo Al 2015-09-24 18:42:17 -04:00
accd8a57e7 [expansion] Regenerating expansion data Al 2015-09-24 16:38:16 -04:00
fa320defb7 [dictionaries] Afrikaans dictionaries for better disambiguatin in South Africa Al 2015-09-24 16:37:16 -04:00
050a850fb9 [dictionaries] Dutch directionals, separating out the west vs westen forms Al 2015-09-24 16:36:52 -04:00
fe5d665533 [dictionaries] Arc in English needn't always expand to Arcade Al 2015-09-24 16:36:21 -04:00
bcac6a41be [dictionaries] Separating out Austrian toponym abbreviations Al 2015-09-24 16:35:56 -04:00
3ce1669c30 [fix] import Al 2015-09-24 01:25:00 -04:00
c85ce0b11d [osm/formatting] Tagging separators as well in tagged output of the address formatter Al 2015-09-24 01:22:49 -04:00
f6c30778bf [normalize] New token normalization option for replacing digits with 'D' for masking numbers e.g. when learning patterns (so 1234 and 5678 both normalize to DDDD). Shouldn't be used by libpostal API, just by the feature extractors in the machine learning models. Also adding better possessive handling. Al 2015-09-23 19:40:51 -04:00
a1d272077d [doc] Averaged perceptron tagger Al 2015-09-23 19:37:55 -04:00
4a0da67aa1 [fix] warning Al 2015-09-23 04:06:54 -04:00
88bd0cd158 [unicode] better segmentation on script breaks Al 2015-09-23 04:06:34 -04:00
377c947541 [transliteration] Regenerating transliteration data files Al 2015-09-23 04:04:38 -04:00
abfb1d4a60 [transliteration] Wide char support in transliteration data generator Al 2015-09-23 03:56:12 -04:00
7e057b0fb8 [utils] basic functions for wide char support for narrow Python builds (unichr, ord, unicode iteration) Al 2015-09-23 00:42:48 -04:00
8562c7a5cb [unicode] Adding wide char support for language disambiguation (comes up in venue names), despite the likelihood of running on a narrow Python build. Rolling back common script chars at a script break, so in the case of e.g. Cyrllic name (Latin name), the segmentation is done at the space before the paren. Al 2015-09-23 00:37:53 -04:00
19e5457a0f [unicode] Regenerated unicode scripts data file, using simple integers instead of repeating the enum types for succinctness Al 2015-09-23 00:36:24 -04:00
4ad3fac627 [unicode] Regenerated unicode script types (ignore extraneous scripts, they're not used, just reside in the upper unicode planes) Al 2015-09-23 00:34:58 -04:00
13bcc35523 [unicode] Allowing wide chars in unicode properties Al 2015-09-23 00:34:07 -04:00
f13e9fad90 [tokenization] Regenerated scanner.c Al 2015-09-23 00:33:27 -04:00
b4593b6f88 [unicode/tokenization] Using new character classes including wide chars in scanner Al 2015-09-23 00:33:14 -04:00
a76831df7a [unicode] Wide version of word breaks Al 2015-09-22 18:55:33 -04:00
25917cfb17 [fix] scripts Al 2015-09-22 15:15:30 -04:00
b405a53fe1 [fix] chars out of range in get_string_script Python version Al 2015-09-22 08:14:27 -04:00
ca25b48687 [fix] Not writing empty fields in formatted addresses Al 2015-09-22 08:13:55 -04:00
747de1944b [fix] Accounting for unknown scripts in disambiguation Al 2015-09-21 18:05:28 -04:00
3ac89d7ed9 [setup] fixing packaging Al 2015-09-21 17:31:15 -04:00
236737eab3 [tokenization/osm] Using utf8 encoded version of string for tokens in python tokenizer Al 2015-09-21 17:27:43 -04:00
134cf616d6 [osm] Using street for language disambiguation in training data Al 2015-09-21 04:09:15 -04:00
ccac4a5a90 [fix] package directory Al 2015-09-21 03:50:05 -04:00
5f912ddcd3 [fix] std=c99 Al 2015-09-21 03:25:32 -04:00
5b2fd0be50 [fix] pytokenize compilation on Ubuntu/gcc Al 2015-09-21 03:24:14 -04:00
cffa5a4a20 [fix] stdint include Al 2015-09-20 20:10:47 -04:00
25b3338600 [setup] setup.py for pypostal so it can be installed from the Github url Al 2015-09-20 20:07:59 -04:00
84cf21df88 [osm] Separating address formatter into its own module, adding some documentation of the various training sets with examples Al 2015-09-20 19:23:13 -04:00
5485ea2197 [python] Adding initial pypostal bindings for tokenize so we can remove address_normalizer dependency. Not tested on Python 3. Al 2015-09-20 14:59:33 -04:00
3fab0f984f [fix] fixing some compiler warnings, using type-specific abs functions for vector_math Al 2015-09-19 16:10:47 -04:00
6731395ca0 [osm] Separating tagged from untagged output Al 2015-09-19 14:11:47 -04:00

Commit Graph Select branches Hide Pull Requests main master Mono Color

Commit Graph

Select branches

Hide Pull Requests

main

master