libpostal

Author	SHA1	Message	Date
Al	308ceb5a5f	[fix] convert UTF8 slices back to unicode before using with the Python trie	2016-01-23 20:20:23 -05:00
Al	5eb6bb309b	[fix] Only adding whitespace back into tokenized strings during abbreviation if it existed in the original string	2016-01-23 20:09:45 -05:00
Al	d61207e95a	[fix] var name	2016-01-23 18:01:02 -05:00
Al	e44cba1d06	[fix] geonames db not required in OSM training data	2016-01-23 17:59:55 -05:00
Al	4f03711e60	[osm] Adding abbreviated training examples to ways language training data	2016-01-23 14:10:47 -05:00
Al	c9fb4ee69d	[osm/formatting] Dropping state more often than not, except in the US and Canada where those fields are more commonly used	2016-01-22 17:58:24 -05:00
Al	ea9bb3f2d5	[fix] Abbreviation probabilities should only apply once, not once per dictionary. Also fixing issues where some of the abbreviations were doubled	2016-01-22 15:48:21 -05:00
Al	f9f6558e06	[fix] simple whitespace field splits for the limited format training data (used for language classification)	2016-01-22 04:34:42 -05:00
Al	cd1db7b288	[fix] Making sure rare components are dropped first, adding state and country back in	2016-01-22 04:17:19 -05:00
Al	adc3a00264	[fix] var name	2016-01-22 04:10:16 -05:00
Al	261beffa36	[fix] Actually better to remove country and state from rare components and let them use the standard dropout probabilities	2016-01-22 04:00:45 -05:00
Al	a6cc3d0114	[fix] Adding state to the more frequently dropped components	2016-01-22 03:56:38 -05:00
Al	bca3dae004	[fix] state full name probabilities for limited vs. full formatted OSM training sets	2016-01-22 03:54:20 -05:00
Al	d1cf253092	[osm/formatting] Higher probability of dropout for rare components like counties, etc.	2016-01-22 03:39:35 -05:00
Al	9dd965a6fa	[fix] removing gazetteer configuration from disambiguation module	2016-01-22 03:18:18 -05:00
Al	b22646ee30	[mv] Moving gazetteers into their own module	2016-01-22 03:15:56 -05:00
Al	5a68e7aeef	[fix] import	2016-01-22 03:00:43 -05:00
Al	6ac72576bc	[osm/formatting] Randomly abbreviating street names and venue names using all the available libpostal dictionaries. Refactoring OSM formatting into separate methods which can be individually tested. Adding override for special phrases like UK	2016-01-22 02:56:39 -05:00
Al	f4995d4f0f	[languages] Adding several different types of dictionaries for name expansion/abbreviation in OSM	2016-01-22 00:51:32 -05:00
Al	26cbb1eb8d	[languages] Fixing multiple expansions in the same dictionary for Python trie, adding length for prefixes/suffixes	2016-01-21 04:29:14 -05:00
Al	0269d92e3d	[languages] Adding canonical string and dictionary type to Python trie, modifying disambiguate_languages accordingly, and adding lists of alternate forms	2016-01-21 02:30:59 -05:00
Al	2e15db06dd	[text] making normalize_string directly callable from Python geodata	2016-01-21 02:07:46 -05:00
Al	71e01e6133	[fix] prefix/suffix phrase search in Python trie search	2016-01-19 03:43:54 -05:00
Al	39667b73a2	[build] std=gnu99 in geodata build	2016-01-19 03:23:56 -05:00
Al	8b94a018e6	[languages] encoding in language disambiguation	2016-01-19 03:22:03 -05:00
Al	3262d2ccd3	[fix] arg count	2016-01-19 03:16:14 -05:00
Al	fe8f3158f6	[fix] missing file in geodata	2016-01-17 22:23:44 -05:00
Al	5fd9dc7e2b	[scripts] relative dirs in setup.py for geodata	2016-01-17 22:22:50 -05:00
Al	da62ff309e	[transliteration] Fixing Malayalam script	2016-01-17 22:15:56 -05:00
Al	8030b235e6	[languages] Changing the definition in script languages so only languages that appear on street signs will be used	2016-01-17 22:03:41 -05:00
Al	3d7dd8966e	[languages] Using unicode script in language disambiguation in addition to dictionaries. Eliminating dependency on address_normalizer	2016-01-17 18:28:28 -05:00
Al	fa32eacdd1	[phrases] Adding Python phrase filter from address_normalizer until a Python wrapper around libpostal's trie_search is available	2016-01-17 15:45:02 -05:00
Al	f79a3c5bf4	[osm/polygons] Allowing polygons that GEOS claims are invalid in OSM polygon index (there were some glaring omissions from the index like the polygons for the UK or Berlin). For some reason .buffer(0) creates weird multipolygons that no longer contain their centroids, etc. and aren't useful in reverese geocoding	2016-01-17 15:43:21 -05:00
Al	04f251c1cc	[polygons] Don't call fix_polygon (force polygon validity) by default	2016-01-16 21:21:27 -05:00
Al	19a5541a85	[polygons/osm] append polygon nodes by vertices that connect to each other	2016-01-16 21:20:49 -05:00
Al	58e53cab1c	[scripts] Adding the tokenize/normalize wrappers directly into the internal geodata package so pypostal can be maintained in an independent repo	2016-01-12 13:29:31 -05:00
Al	e9e05bb929	[transliteration] Distinguishing between variables with numbers and backreferences in transliteration rules	2015-12-23 13:07:44 -05:00
Al	e55ff54be1	[fix] Adding Korean-Latin-BGN to excluded transliterators	2015-12-21 16:24:50 -05:00
Al	682c316775	[transliteration] Removing Korean-Latin-BGN, not a great transliterator and AFAICT, ICU doesn't use it either	2015-12-21 12:45:45 -05:00
Al	ccf509edb1	[fix] update to control characters for generating the transliteration rules	2015-12-20 15:40:38 -05:00
Al	b2a944830a	[transliteration] Making sure the Python script to generate transliteration data works on the new CLDR format	2015-12-19 00:34:30 -05:00
Al	1d288954d7	[osm] Fixing an issue in the training data with house numbers in OSM (seen mostly in Uruguay) where a comma separated list of house numbers is entered.	2015-12-10 18:46:28 -05:00
Al	779298360c	[osm] In cases with more than one official language and where the address language can be determined, use it for looking up language-specific OSM polygons	2015-12-09 01:00:59 -05:00
Al	aeb72d7d26	[osm] Randomly select up to n components for state_district OSM boundaries. For all other fields select one name at random	2015-12-09 00:20:20 -05:00
Al	69a469d9d3	[osm] Choosing a language at random in countries with multilingual addresses for the parser training data so we get some monolingual examples	2015-12-08 20:38:32 -05:00
Al	35db855819	[fix] canonical index in address expansion data, should be -1 for all canonical phrases	2015-12-08 15:09:51 -05:00
Al	f8a3081d0f	[fix] city name in OSM formatting	2015-12-07 02:33:12 -05:00
Al	b25a738000	[osm] Doing more deduping in the OSM training data to avoid confusing the parser when city, state, district all have the same name	2015-12-06 16:14:02 -05:00
Al	dd8f8b4d7b	[fix] prefix/suffix regexes	2015-12-05 18:41:22 -05:00
Al	5fcb6d2c30	[fix] typo	2015-12-05 16:23:58 -05:00

1 2 3 4 5 ...

492 Commits