Al
|
a278cfd12c
|
[transliteration] Using revisit strings instead of keeping a backtrack count so we don't have to later map logical characters to the actual string, removing any duplicate keys in the table builder so that if any rules happen to overlap within a step, the first will take precedence
|
2015-05-29 16:54:05 -04:00 |
|
Al
|
a9d5b91ac0
|
[transliteration] Not counting repeat character in group capture
|
2015-05-28 19:36:25 -04:00 |
|
Al
|
c00ecf6ea8
|
[fix] minimizing c* into (c|'')+, using empty transition instead of zero-length string
|
2015-05-22 18:11:54 -04:00 |
|
Al
|
b2d15b29cf
|
[fix] greek_latin_ungegn => greek-latin-ungegn
|
2015-05-22 09:52:08 -04:00 |
|
Al
|
d65f7747f0
|
[transliteration] Adding html escapes as the first step in the Latin-ASCII transformation
|
2015-05-20 14:44:55 -04:00 |
|
Al
|
4694371cdc
|
[fix] unicode escaping the German transliterations
|
2015-05-18 13:55:57 -04:00 |
|
Al
|
e25f039ee4
|
[transliteration] Escaped single quotes in rules + ignoring rules with codepoints > \uffff
|
2015-05-17 18:31:35 -04:00 |
|
Al
|
d72348d47e
|
[transliteratin] Using a restricted set of diacritical marks relevant to Greek, variants stand in for transliterator dependencies e.g. use Katakana-Latin-BGN if Katakana-Latin cannot be found
|
2015-05-17 17:42:37 -04:00 |
|
Al
|
30db201e8a
|
[fix] NUM_CHARS => NUM_CODEPOINTS
|
2015-05-17 13:53:19 -04:00 |
|
Al
|
99115fa53c
|
[transliteration] converting one of the more complicated and frequently used rules to its utf8proc equivalent, adding better support for escaped unicode characters and set differences, generating a header file indicating which unicode script/language pairs warrant various transliterators.
|
2015-05-16 23:13:01 -04:00 |
|
Al
|
5983cb6af0
|
[i18n] Adding NUM_SCRIPTS to the end of the scripts enum
|
2015-05-16 12:19:40 -04:00 |
|
Al
|
1f3ac0c3f9
|
[transliteration] using a proper lexer on the entire rule to correct some parses, allowing bracketed multiple characters in sets, fixing optionals
|
2015-05-14 16:34:03 -04:00 |
|
Al
|
304dc9525a
|
[transliteration] fixing variable assignments, literal wide characters (for narrow Python builds), ignoring rules related to spaced Han
|
2015-05-13 16:20:52 -04:00 |
|
Al
|
5bbf71ccbb
|
[transliteration] Using breadth-first search for tracking dependencies between transforms, removing Han-Spacedhan since our tokenizer does the equivalent already
|
2015-05-12 18:57:57 -04:00 |
|
Al
|
b55db5fcda
|
[fix] usage text
|
2015-05-12 12:15:51 -04:00 |
|
Al
|
d5f9d8a29a
|
[mv] unicode_scripts => unicode_properties
|
2015-05-12 12:14:59 -04:00 |
|
Al
|
ff0e7cb9e1
|
[i18n] downloading several files from the Unicode Character Database
|
2015-05-12 12:12:17 -04:00 |
|
Al
|
3814af52ec
|
[transliteration] Python script now implements the full TR-35 spec, including filter rules, which cuts down significantly on the size of the data file and complexity of generating the trie
|
2015-05-12 12:10:15 -04:00 |
|
Al
|
fe044cebef
|
[transliteration] char set mapping for some of the more complicated sets found in CLDR
|
2015-05-10 18:34:53 -04:00 |
|
Al
|
2a69488f9b
|
[fix] for transliteration rules, allowing the parsing of set differencees and arbitrarily nested character set expressions, using non-NUL byte for the empty transition. Adding resulting data file.
|
2015-05-08 17:14:26 -04:00 |
|
Al
|
10ebaf147a
|
[transliteration] literal ^ and $ escaped
|
2015-05-01 19:16:36 -04:00 |
|
Al
|
ff851a464c
|
[fix] escaping curly braces for regex compilation
|
2015-04-30 13:27:17 -04:00 |
|
Al
|
fa43abd8d9
|
[transliteration] For ruleset steps in transliteration, the name is just the step number, which can be appended to the trie as part of the key
|
2015-04-29 14:31:15 -04:00 |
|
Al
|
1c25238af7
|
[fix] string lengths on the various transliteration rules
|
2015-04-27 13:51:35 -04:00 |
|
Al
|
6ebea11640
|
[transliteration] fixing transliteration rules, fixing escape characters, adding sizes to all the strings as they may have null characters
|
2015-04-26 19:47:54 -04:00 |
|
Al
|
be29874f13
|
[transliteration] Parser for CLDR transforms to generate (simple) C transform rules
|
2015-04-25 15:42:21 -04:00 |
|
Al
|
24e62b1c6c
|
[tokenization] Script to generate TR-29 ranges for re2c scanner
|
2015-04-14 15:50:50 -04:00 |
|
Al
|
5fa03587fb
|
[cldr] simple Python scanner for creating dynamic scanners for CLDR rule parsing
|
2015-04-14 15:49:24 -04:00 |
|
Al
|
6e9295154a
|
[fix] local dirs for cldr data
|
2015-04-14 15:46:15 -04:00 |
|
Al
|
744231c148
|
[fix] cldr supplemental uses local copy
|
2015-04-13 19:03:44 -04:00 |
|
Al
|
a8b9981c9b
|
[fix] vars
|
2015-04-13 19:03:14 -04:00 |
|
Al
|
d1267145f7
|
[fix] args to wget
|
2015-04-13 19:02:50 -04:00 |
|
Al
|
d771da7c78
|
[i18n] unicode scripts file downloaded and cached locally
|
2015-04-13 19:02:29 -04:00 |
|
Al
|
cc4d2d08eb
|
[cldr] Adding script to download latest cldr release instead of pulling from the repo
|
2015-04-13 01:03:15 -04:00 |
|
Al
|
acb575c84c
|
[fix] splitting out methods for unicode scripts
|
2015-04-12 15:21:23 -04:00 |
|
Al
|
d50d7d182e
|
[fix] geonames import script for admin 1 codes
|
2015-04-12 12:16:08 -04:00 |
|
Al
|
fdd0c489f3
|
[fix] refactoring unicode script fetching into more reusable functions
|
2015-04-09 02:18:13 -04:00 |
|
Al
|
e03c1f21a7
|
[unicode] generate C headers/data files from unicode.org scripts
|
2015-03-18 16:59:58 -04:00 |
|
Al
|
6c8e5b45a4
|
[fix] removing building alias (for OSm it means building category), fix to fetch script
|
2015-03-18 08:40:07 -04:00 |
|
Al
|
88554c1ef7
|
[i18n] adding CLDR languages script to this repo
|
2015-03-18 08:01:36 -04:00 |
|
Al
|
2cf909c01e
|
[utils] script utils
|
2015-03-17 18:39:08 -04:00 |
|
Al
|
aeac0fe8c0
|
[geodata] Script to construct OSM training examples for building language dictionaries, disambiguating between abbreviations, classifying venues by type and formatting addresses for use in a sequence model with Lokku's address-formatting repo.
|
2015-03-17 18:11:07 -04:00 |
|
Al
|
0437271c92
|
[geodata] OSM planet fetch needs to convert ways/relations to nodes for all data sets
|
2015-03-17 16:51:17 -04:00 |
|
Al
|
621b25c964
|
[geodata] script to fetch/transform OSM planet (needs about 100GB of disk free) training language models
|
2015-03-16 00:45:14 -04:00 |
|
Al
|
26c2823208
|
[fix] comma
|
2015-03-14 18:58:18 -04:00 |
|
Al
|
3e20b4f600
|
[fix] Capturing GeoNames canonical and alternate names with a UNION ALL query, creating C headers with the field orderings for parsing the TSV file downstream
|
2015-03-14 18:02:14 -04:00 |
|
Al
|
284af74ba4
|
[geodisambig] Python scripts to prep GeoNames records for trie insertion
|
2015-03-13 11:56:48 -04:00 |
|