Commit Graph

83 Commits

Author SHA1 Message Date
Al
b55db5fcda [fix] usage text 2015-05-12 12:15:51 -04:00
Al
d5f9d8a29a [mv] unicode_scripts => unicode_properties 2015-05-12 12:14:59 -04:00
Al
ff0e7cb9e1 [i18n] downloading several files from the Unicode Character Database 2015-05-12 12:12:17 -04:00
Al
3814af52ec [transliteration] Python script now implements the full TR-35 spec, including filter rules, which cuts down significantly on the size of the data file and complexity of generating the trie 2015-05-12 12:10:15 -04:00
Al
fe044cebef [transliteration] char set mapping for some of the more complicated sets found in CLDR 2015-05-10 18:34:53 -04:00
Al
2a69488f9b [fix] for transliteration rules, allowing the parsing of set differencees and arbitrarily nested character set expressions, using non-NUL byte for the empty transition. Adding resulting data file. 2015-05-08 17:14:26 -04:00
Al
10ebaf147a [transliteration] literal ^ and $ escaped 2015-05-01 19:16:36 -04:00
Al
ff851a464c [fix] escaping curly braces for regex compilation 2015-04-30 13:27:17 -04:00
Al
fa43abd8d9 [transliteration] For ruleset steps in transliteration, the name is just the step number, which can be appended to the trie as part of the key 2015-04-29 14:31:15 -04:00
Al
1c25238af7 [fix] string lengths on the various transliteration rules 2015-04-27 13:51:35 -04:00
Al
6ebea11640 [transliteration] fixing transliteration rules, fixing escape characters, adding sizes to all the strings as they may have null characters 2015-04-26 19:47:54 -04:00
Al
be29874f13 [transliteration] Parser for CLDR transforms to generate (simple) C transform rules 2015-04-25 15:42:21 -04:00
Al
24e62b1c6c [tokenization] Script to generate TR-29 ranges for re2c scanner 2015-04-14 15:50:50 -04:00
Al
5fa03587fb [cldr] simple Python scanner for creating dynamic scanners for CLDR rule parsing 2015-04-14 15:49:24 -04:00
Al
6e9295154a [fix] local dirs for cldr data 2015-04-14 15:46:15 -04:00
Al
744231c148 [fix] cldr supplemental uses local copy 2015-04-13 19:03:44 -04:00
Al
a8b9981c9b [fix] vars 2015-04-13 19:03:14 -04:00
Al
d1267145f7 [fix] args to wget 2015-04-13 19:02:50 -04:00
Al
d771da7c78 [i18n] unicode scripts file downloaded and cached locally 2015-04-13 19:02:29 -04:00
Al
cc4d2d08eb [cldr] Adding script to download latest cldr release instead of pulling from the repo 2015-04-13 01:03:15 -04:00
Al
acb575c84c [fix] splitting out methods for unicode scripts 2015-04-12 15:21:23 -04:00
Al
d50d7d182e [fix] geonames import script for admin 1 codes 2015-04-12 12:16:08 -04:00
Al
fdd0c489f3 [fix] refactoring unicode script fetching into more reusable functions 2015-04-09 02:18:13 -04:00
Al
e03c1f21a7 [unicode] generate C headers/data files from unicode.org scripts 2015-03-18 16:59:58 -04:00
Al
6c8e5b45a4 [fix] removing building alias (for OSm it means building category), fix to fetch script 2015-03-18 08:40:07 -04:00
Al
88554c1ef7 [i18n] adding CLDR languages script to this repo 2015-03-18 08:01:36 -04:00
Al
2cf909c01e [utils] script utils 2015-03-17 18:39:08 -04:00
Al
aeac0fe8c0 [geodata] Script to construct OSM training examples for building language dictionaries, disambiguating between abbreviations, classifying venues by type and formatting addresses for use in a sequence model with Lokku's address-formatting repo. 2015-03-17 18:11:07 -04:00
Al
0437271c92 [geodata] OSM planet fetch needs to convert ways/relations to nodes for all data sets 2015-03-17 16:51:17 -04:00
Al
621b25c964 [geodata] script to fetch/transform OSM planet (needs about 100GB of disk free) training language models 2015-03-16 00:45:14 -04:00
Al
26c2823208 [fix] comma 2015-03-14 18:58:18 -04:00
Al
3e20b4f600 [fix] Capturing GeoNames canonical and alternate names with a UNION ALL query, creating C headers with the field orderings for parsing the TSV file downstream 2015-03-14 18:02:14 -04:00
Al
284af74ba4 [geodisambig] Python scripts to prep GeoNames records for trie insertion 2015-03-13 11:56:48 -04:00