libpostal

Author	SHA1	Message	Date
Karthik Janarthanan	5c361eef7d	Remove unused regex that can cause exponential backtracking when used	2025-02-25 15:17:06 -06:00
Al	579425049b	[fix] with the new CLDR transform format, reverse the lines rather than the nodes in reverse transliterators	2017-03-17 18:28:15 -04:00
Al	a0b508caf6	[transliteration] adding no-args option for transliteration_rules script	2017-02-15 13:22:33 -05:00
Al	293587bae9	[addresses] adding new config for postal codes around the world. Allows appending the ISO alpha-2 country code to the beginning of the postcode as in e.g. SI-1000 (only used if the postcode begins with a digit). This system was used for postal codes in continental Europe as a recommendation from the CEPT. Now 7 member states still use it, so in those countries add the country-code with higher probability. The config also contains the license plate codes for countries where e.g. L-1234 might be used instead of LU-1234. Allows configuring in which countries postcodes should be validated using Google's per-country validation regexes (and the ability to override with a custom regex), and in which countries other admin component names should be stripped.	2017-02-10 23:53:50 -05:00
Al	321f2034d2	[fix] unidata file	2017-01-05 04:24:33 -05:00
Al	25723fcea2	[transliteration] making the custom rules in transliteration less repetitious and accessible from elsewhere, removing string names for common transliterators and using constants	2017-01-05 04:06:51 -05:00
Al	600b40d2f6	[transliteration] adding german-ascii transliteration to Estonian to handle umlauts (ä => ae, etc.)	2017-01-02 13:51:56 -05:00
Al	77efcb3f89	[fix] only accept language suffixes that are valid scripts or transliterations of CJK languages. Set language to language suffix so Romaji forms get used, etc.	2016-12-24 17:17:09 -05:00
Al	6081df0cd1	[osm] adding admin1 ids to the OSM country rtree	2016-10-04 23:12:15 -04:00
Al	cb4408fea8	[transliteration] Adding language-specific transliterators for handling umlauts in German + special transliterations in the Nordic languages. It may still result in some wrong transliterations if the language classifier is wrong, but generally it's accurate enough that its predictions can be relied upon. Also adding a Latin-ASCII-Simple transform which only does the punctuation portion of Latin-ASCII so it won't change anything substantial about the input string.	2016-08-20 18:17:46 -04:00
Al	93586c2592	[fix] aliasing all_languages	2016-08-18 02:24:59 -04:00
Al	1ef57ee7d2	[i18n/postcodes] Fetching postcode regexes from the data source used by Google's libaddressinput, caches requests for the length of the running program (e.g. generating parser data, so the regexes will get updated over time).	2016-07-26 17:42:50 -04:00
Al	cdf8829942	[fix] no longer requiring argv for unicode_properties script	2016-07-21 17:04:57 -04:00
Al	6703da8fc3	[fix] languages and disambiguation do initialization by default	2016-07-21 17:04:57 -04:00
Al	c506649252	[fix] languages_intialized	2016-07-21 17:04:57 -04:00
Al	5e2d9f371e	[numex] Moving numex script to a different subpackage, adding function for creating ordinals	2016-07-21 17:04:57 -04:00
Al	1bc92d6995	[fix] output path in numex.py	2016-03-29 11:25:36 -04:00
Al	2a2d1738a3	[fix] path for running numex.py	2016-03-29 11:15:24 -04:00
Al	da62ff309e	[transliteration] Fixing Malayalam script	2016-01-17 22:15:56 -05:00
Al	8030b235e6	[languages] Changing the definition in script languages so only languages that appear on street signs will be used	2016-01-17 22:03:41 -05:00
Al	e9e05bb929	[transliteration] Distinguishing between variables with numbers and backreferences in transliteration rules	2015-12-23 13:07:44 -05:00
Al	e55ff54be1	[fix] Adding Korean-Latin-BGN to excluded transliterators	2015-12-21 16:24:50 -05:00
Al	682c316775	[transliteration] Removing Korean-Latin-BGN, not a great transliterator and AFAICT, ICU doesn't use it either	2015-12-21 12:45:45 -05:00
Al	ccf509edb1	[fix] update to control characters for generating the transliteration rules	2015-12-20 15:40:38 -05:00
Al	b2a944830a	[transliteration] Making sure the Python script to generate transliteration data works on the new CLDR format	2015-12-19 00:34:30 -05:00
Al	7f5cf89e84	[transliteration] Not escaping right side transliteration rules	2015-10-27 12:24:38 -04:00
Al	7dfbcce9ec	[languages] options for get_country_languages	2015-09-30 04:09:07 -04:00
Al	5417b4e602	[unicode] Downloading latest UnicodeData.txt instead of using builtin Python module (out of date) e.g. for getting unicode codepoint categories	2015-09-25 23:59:38 -04:00
Al	abfb1d4a60	[transliteration] Wide char support in transliteration data generator	2015-09-23 03:56:12 -04:00
Al	13bcc35523	[unicode] Allowing wide chars in unicode properties	2015-09-23 00:34:07 -04:00
Al	b4593b6f88	[unicode/tokenization] Using new character classes including wide chars in scanner	2015-09-23 00:33:14 -04:00
Al	a76831df7a	[unicode] Wide version of word breaks	2015-09-22 18:55:33 -04:00
Al	a916668f28	[i18n] Local file for ISO 15924	2015-09-01 23:58:36 -04:00
Al	b8e4c19146	[mv] Moving the get regional/country languages logic out of language polygons	2015-08-23 14:25:33 -04:00
Al	122a81b610	[languages] non-default languages can still be labeled from > 1 char abbreviations if there's no evidence of other languages in the string. Adding Python version of get_string_script from the C lib	2015-08-23 02:26:06 -04:00
Al	0701bb6f08	[fix] import	2015-08-22 23:19:43 -04:00
Al	d97c725bbc	[languages] Allowing specification of multiple regional languages	2015-08-18 03:18:52 -04:00
Al	03febc7e20	[scripts] Better script code aliasing	2015-08-13 18:25:55 -04:00
Al	b54ff95ecc	[mv] csv_utils	2015-08-13 18:19:54 -04:00
Al	cf70615850	[transliteration] Doing HTML escapes first in Latin-ASCII transliteration as they may need to be resolved further in subsequent steps	2015-08-11 23:10:55 -04:00
Al	51addec5f2	[fix] check for local CLDR in unicode properties	2015-08-11 20:23:48 -04:00
Al	882e4c2ab8	[fix] ensure CLDR dir	2015-08-11 20:04:42 -04:00
Al	48566bf097	[fix] cldr languages dir	2015-08-11 20:04:25 -04:00
Al	dd391eabe5	[numex] Separating rules from keys for Linux gcc compilation	2015-08-09 01:00:57 -04:00
Al	1d39916aaa	[fix] Fixing warnings in unicode script data	2015-08-02 21:30:54 -06:00
Al	87566bb6a5	[numex] Adding validation checks for numex JSON	2015-07-24 15:22:07 -04:00
Al	64a63fdf51	[mv] Moving all repo data files to a resources dir, data is only for runtime files	2015-07-21 18:11:36 -04:00
Al	076c07e21f	[fix] Add minor languages to the language set	2015-07-16 00:58:58 -04:00
Al	95a6845a85	[i18n] Adding regional languages as valid country languages	2015-07-08 14:54:00 -04:00
Al	a580ed0b1b	[transliteration] Adding numeric HTML escapes e.g. '&'	2015-06-29 15:02:34 -04:00

1 2 3

106 Commits