Al
7a31802a04
[fix] also fix german-ascii transliteration on uppercase U with umlaut
2017-01-05 04:07:29 -05:00
Al
25723fcea2
[transliteration] making the custom rules in transliteration less repetitious and accessible from elsewhere, removing string names for common transliterators and using constants
2017-01-05 04:06:51 -05:00
Al
3fcaae3dbc
[openaddresses] add Canton of Solothurn, Switzerland
2017-01-05 02:23:20 -05:00
Al
4182123fa6
[openaddresses] adding Schaffhausen, also adding language=de for the last few cantons
2017-01-05 01:40:30 -05:00
Al
72e6bf043b
[openaddresses] add Basel-Stadt, Switzerland
2017-01-05 01:26:20 -05:00
Al
3d16c20d24
[openaddresses] add Boyd County, KY
2017-01-05 01:25:41 -05:00
Al
c5cca4c82f
[openaddresses] add Canton of Basel-Landschaft, Switzerland
2017-01-04 02:34:15 -05:00
Al
3e7042597e
[openaddresses] adding Jamaica countrywide to OpenAddresses config
2017-01-04 02:32:41 -05:00
Al
bcd61ffbe8
[formatting] moving postcode to the beginning of the address only in countries using the continental European conventions. Creates more ambiguity than is worthwhile in the US, etc. when, say, house_number is removed from a training example and the postcode is inserted first (could very easily be a house_number)
2017-01-03 03:39:16 -05:00
Al
38e147d210
[fix] address configs for Greek/Hebrew
2017-01-03 03:07:53 -05:00
Al
de2dffa315
[addresses] adding Calle to purely numeric Spanish street names in OSM as well
2017-01-02 23:41:01 -05:00
Al
ccd555d020
[transliteration] regenerated transliteration_scripts_data.c
2017-01-02 13:52:48 -05:00
Al
600b40d2f6
[transliteration] adding german-ascii transliteration to Estonian to handle umlauts (ä => ae, etc.)
2017-01-02 13:51:56 -05:00
Al
b2b7f6f155
[osm] add wikipedia:* to rail station exception
2017-01-02 13:13:42 -05:00
Al
a99a1e759e
[openaddresses] adding Rio de Janeiro, Stockholm, and Liechtenstein. Adding higher CLDR country probability for smaller countries
2017-01-02 03:29:36 -05:00
Al
77035fbdbd
[strings] adding utf8_is_whitespace to the header so it can be referenced from multiple files
2017-01-02 02:23:21 -05:00
Al
400ea589ef
[normalize] add NORMALIZE_STRING_SIMPLE_LATIN_ASCII option to pynormalize
2017-01-02 02:08:54 -05:00
Al
182976214c
[logging] converting most of the steps in building the transliteration table to use debug logging
2017-01-02 00:41:11 -05:00
Al
d8d3840700
[transliteration] constant for the html-escape transliterator
2017-01-02 00:40:12 -05:00
Al
4ad3a52fe1
[strings] fix lowercasing in string_utils.c
2017-01-01 20:08:34 -05:00
Al
a78937f265
[normalize] use the new utf8proc lowercasing (as opposed to case folding), free copies since none of the string functions operate in-place any more, add minimal HTML escaping transliterator even to ASCII text
2017-01-01 20:06:32 -05:00
Al
5c56a44faa
[strings] reverting to utf8proc v1.3.1, as 2.0 and above can chop off certain sequences
2017-01-01 20:03:23 -05:00
Al
fe88630f78
[dictionaries] regenerating address_expansion_data.c from upstream changes
2017-01-01 14:26:54 -05:00
Al
101bbcc02d
Merge remote-tracking branch 'origin/master' into parser-data
2017-01-01 14:25:37 -05:00
Travis
d61e90a33d
[auto][ci skip] Adding data files from Travis build #188
2017-01-01 19:20:54 +00:00
Al Barrentine
6048d6a71e
Merge pull request #149 from iestynpryce/master
...
Enhanced the Welsh (cy) language dictionaries.
2017-01-01 14:11:16 -05:00
Al
0b5cc96654
[transliteration] add decompose option when stripping accents
2017-01-01 13:54:20 -05:00
Al
7d6c85aeec
[fix] new string tree iterator, don't decrement permutations on rollovers
2017-01-01 13:34:08 -05:00
Al
1780c5e053
[fix] moving enum
2016-12-31 13:01:57 -05:00
Iestyn Pryce
d8ee43156e
Enhanced the Welsh (cy) language dictionaries.
2016-12-31 09:46:58 +00:00
Al
475aa3dbfa
[strings] fixing and simplifying string tree iterator. This version is inspired by Python's itertools.product (itertoolsmodule.c has so many goodies)
2016-12-31 03:22:27 -05:00
Al
261ec3888a
[strings] header changes for new utf8 lower/upper functions
2016-12-31 03:20:43 -05:00
Al
58b063b632
[strings] making string_tree_iterator_done more meaningful (returns true if the iterator has no paths left to traverse)
2016-12-31 00:54:36 -05:00
Al
8978000320
[strings] adding latest utf8proc, new functions for utf8_lower (instead of case folding) and utf8_upper, and a utf8_is_whitespace that takes things like tabs into account
2016-12-31 00:52:12 -05:00
Al
db16e656ca
[parser/cli] adding .print_features option in address_parser client for debugging
2016-12-31 00:20:35 -05:00
Al
bdb51a244e
[phrases] fix case in trie search when searching for tokens in a string tail. If we're on the last token in a sequenence and the token matches the tail, check that the tail is complete, and if so return the match before exiting the loop. Affects multiword phrases that tend to appear toward the end of a sequence (long country names like "United States of America", etc.)
2016-12-29 16:17:09 -05:00
Al
2d077699e6
[places] adding is_in property to the set of tags for the places index. This may allow us to make more granular exceptions for node-based places that are actually suburbs but classified as {hamlet, village, locality, town}, etc. if the is_in contains a city that's also a boundary or nearby point
2016-12-29 14:04:13 -05:00
Al
cad57b94b2
[boundaries] mapping place=hamlet to suburb for all of Malaysia. place=village becomes suburb as well in the urban core
2016-12-29 14:01:57 -05:00
Al
21a2a7419a
[addresses] only add village as city component if no city can be found in the area
2016-12-29 13:41:05 -05:00
Al
8080e16791
[openaddresses] adding Joinville, Brasil and adding OSM boundaries for Brasilian address data sets
2016-12-29 13:27:49 -05:00
Al
0b6947840c
[dictionaries] removing Belarusian place_names.txt
2016-12-29 03:24:57 -05:00
Al
05732f6718
[build] Makefile changes for new parser feature extraction
2016-12-29 02:39:29 -05:00
Al
091167ed3c
[api] remove geodb from libpostal.c
2016-12-29 02:35:43 -05:00
Al
acd953ce51
[parser] first pass at new parser feature extraction
...
- removing geodb phrases
- use Latin-ASCII-simple transliteration (no umlauts, etc.)
- no digit normalization for admin component phrases and postcodes
- tag = START + word, special feature for first word in the sequence
- add the new admin boundary categories
- for hyphenated non-phrase words, add each sub-word
- for rare and unknown words, add ngram features of 3-6 characters with
underscores to indicate beginnings and endings (similar to language
classifier features)
- defines notion of "rare words" (known words with a frequency <= n where
n > the unknown word threshold), so known words can share
statistical strength with artificial and real unknown words
2016-12-29 02:17:35 -05:00
Al
e62101b8bf
[parser] remove geodb from address_parser_test, sort confusion matrix
2016-12-29 02:14:40 -05:00
Al
174529e8d0
[parser] remove geodb and fix small memory leak in address_parser_train
2016-12-29 02:12:06 -05:00
Al
bde5fdfaad
[merge] merging in master
2016-12-29 02:00:31 -05:00
Al
646d96e13e
Merge remote-tracking branch 'origin/master' into parser-data
2016-12-29 01:58:38 -05:00
Al
a26a01ece3
[openaddresses] adding SEMCOG counties, MI
2016-12-28 19:37:44 -05:00
Al
22b4a215f4
[places] additional form for West Indies
2016-12-28 17:58:32 -05:00