106 Commits

Author SHA1 Message Date
Karthik Janarthanan
5c361eef7d Remove unused regex that can cause exponential backtracking when used 2025-02-25 15:17:06 -06:00
Al
579425049b [fix] with the new CLDR transform format, reverse the lines rather than the nodes in reverse transliterators 2017-03-17 18:28:15 -04:00
Al
a0b508caf6 [transliteration] adding no-args option for transliteration_rules script 2017-02-15 13:22:33 -05:00
Al
293587bae9 [addresses] adding new config for postal codes around the world. Allows appending the ISO alpha-2 country code to the beginning of the postcode as in e.g. SI-1000 (only used if the postcode begins with a digit). This system was used for postal codes in continental Europe as a recommendation from the CEPT. Now 7 member states still use it, so in those countries add the country-code with higher probability. The config also contains the license plate codes for countries where e.g. L-1234 might be used instead of LU-1234. Allows configuring in which countries postcodes should be validated using Google's per-country validation regexes (and the ability to override with a custom regex), and in which countries other admin component names should be stripped. 2017-02-10 23:53:50 -05:00
Al
321f2034d2 [fix] unidata file 2017-01-05 04:24:33 -05:00
Al
25723fcea2 [transliteration] making the custom rules in transliteration less repetitious and accessible from elsewhere, removing string names for common transliterators and using constants 2017-01-05 04:06:51 -05:00
Al
600b40d2f6 [transliteration] adding german-ascii transliteration to Estonian to handle umlauts (ä => ae, etc.) 2017-01-02 13:51:56 -05:00
Al
77efcb3f89 [fix] only accept language suffixes that are valid scripts or transliterations of CJK languages. Set language to language suffix so Romaji forms get used, etc. 2016-12-24 17:17:09 -05:00
Al
6081df0cd1 [osm] adding admin1 ids to the OSM country rtree 2016-10-04 23:12:15 -04:00
Al
cb4408fea8 [transliteration] Adding language-specific transliterators for handling umlauts in German + special transliterations in the Nordic languages. It may still result in some wrong transliterations if the language classifier is wrong, but generally it's accurate enough that its predictions can be relied upon. Also adding a Latin-ASCII-Simple transform which only does the punctuation portion of Latin-ASCII so it won't change anything substantial about the input string. 2016-08-20 18:17:46 -04:00
Al
93586c2592 [fix] aliasing all_languages 2016-08-18 02:24:59 -04:00
Al
1ef57ee7d2 [i18n/postcodes] Fetching postcode regexes from the data source used by Google's libaddressinput, caches requests for the length of the running program (e.g. generating parser data, so the regexes will get updated over time). 2016-07-26 17:42:50 -04:00
Al
cdf8829942 [fix] no longer requiring argv for unicode_properties script 2016-07-21 17:04:57 -04:00
Al
6703da8fc3 [fix] languages and disambiguation do initialization by default 2016-07-21 17:04:57 -04:00
Al
c506649252 [fix] languages_intialized 2016-07-21 17:04:57 -04:00
Al
5e2d9f371e [numex] Moving numex script to a different subpackage, adding function for creating ordinals 2016-07-21 17:04:57 -04:00
Al
1bc92d6995 [fix] output path in numex.py 2016-03-29 11:25:36 -04:00
Al
2a2d1738a3 [fix] path for running numex.py 2016-03-29 11:15:24 -04:00
Al
da62ff309e [transliteration] Fixing Malayalam script 2016-01-17 22:15:56 -05:00
Al
8030b235e6 [languages] Changing the definition in script languages so only languages that appear on street signs will be used 2016-01-17 22:03:41 -05:00
Al
e9e05bb929 [transliteration] Distinguishing between variables with numbers and backreferences in transliteration rules 2015-12-23 13:07:44 -05:00
Al
e55ff54be1 [fix] Adding Korean-Latin-BGN to excluded transliterators 2015-12-21 16:24:50 -05:00
Al
682c316775 [transliteration] Removing Korean-Latin-BGN, not a great transliterator and AFAICT, ICU doesn't use it either 2015-12-21 12:45:45 -05:00
Al
ccf509edb1 [fix] update to control characters for generating the transliteration rules 2015-12-20 15:40:38 -05:00
Al
b2a944830a [transliteration] Making sure the Python script to generate transliteration data works on the new CLDR format 2015-12-19 00:34:30 -05:00
Al
7f5cf89e84 [transliteration] Not escaping right side transliteration rules 2015-10-27 12:24:38 -04:00
Al
7dfbcce9ec [languages] options for get_country_languages 2015-09-30 04:09:07 -04:00
Al
5417b4e602 [unicode] Downloading latest UnicodeData.txt instead of using builtin Python module (out of date) e.g. for getting unicode codepoint categories 2015-09-25 23:59:38 -04:00
Al
abfb1d4a60 [transliteration] Wide char support in transliteration data generator 2015-09-23 03:56:12 -04:00
Al
13bcc35523 [unicode] Allowing wide chars in unicode properties 2015-09-23 00:34:07 -04:00
Al
b4593b6f88 [unicode/tokenization] Using new character classes including wide chars in scanner 2015-09-23 00:33:14 -04:00
Al
a76831df7a [unicode] Wide version of word breaks 2015-09-22 18:55:33 -04:00
Al
a916668f28 [i18n] Local file for ISO 15924 2015-09-01 23:58:36 -04:00
Al
b8e4c19146 [mv] Moving the get regional/country languages logic out of language polygons 2015-08-23 14:25:33 -04:00
Al
122a81b610 [languages] non-default languages can still be labeled from > 1 char abbreviations if there's no evidence of other languages in the string. Adding Python version of get_string_script from the C lib 2015-08-23 02:26:06 -04:00
Al
0701bb6f08 [fix] import 2015-08-22 23:19:43 -04:00
Al
d97c725bbc [languages] Allowing specification of multiple regional languages 2015-08-18 03:18:52 -04:00
Al
03febc7e20 [scripts] Better script code aliasing 2015-08-13 18:25:55 -04:00
Al
b54ff95ecc [mv] csv_utils 2015-08-13 18:19:54 -04:00
Al
cf70615850 [transliteration] Doing HTML escapes first in Latin-ASCII transliteration as they may need to be resolved further in subsequent steps 2015-08-11 23:10:55 -04:00
Al
51addec5f2 [fix] check for local CLDR in unicode properties 2015-08-11 20:23:48 -04:00
Al
882e4c2ab8 [fix] ensure CLDR dir 2015-08-11 20:04:42 -04:00
Al
48566bf097 [fix] cldr languages dir 2015-08-11 20:04:25 -04:00
Al
dd391eabe5 [numex] Separating rules from keys for Linux gcc compilation 2015-08-09 01:00:57 -04:00
Al
1d39916aaa [fix] Fixing warnings in unicode script data 2015-08-02 21:30:54 -06:00
Al
87566bb6a5 [numex] Adding validation checks for numex JSON 2015-07-24 15:22:07 -04:00
Al
64a63fdf51 [mv] Moving all repo data files to a resources dir, data is only for runtime files 2015-07-21 18:11:36 -04:00
Al
076c07e21f [fix] Add minor languages to the language set 2015-07-16 00:58:58 -04:00
Al
95a6845a85 [i18n] Adding regional languages as valid country languages 2015-07-08 14:54:00 -04:00
Al
a580ed0b1b [transliteration] Adding numeric HTML escapes e.g. '&' 2015-06-29 15:02:34 -04:00