libpostal

Author	SHA1	Message	Date
Al	579425049b	[fix] with the new CLDR transform format, reverse the lines rather than the nodes in reverse transliterators	2017-03-17 18:28:15 -04:00
Al	a0b508caf6	[transliteration] adding no-args option for transliteration_rules script	2017-02-15 13:22:33 -05:00
Al	25723fcea2	[transliteration] making the custom rules in transliteration less repetitious and accessible from elsewhere, removing string names for common transliterators and using constants	2017-01-05 04:06:51 -05:00
Al	600b40d2f6	[transliteration] adding german-ascii transliteration to Estonian to handle umlauts (ä => ae, etc.)	2017-01-02 13:51:56 -05:00
Al	cb4408fea8	[transliteration] Adding language-specific transliterators for handling umlauts in German + special transliterations in the Nordic languages. It may still result in some wrong transliterations if the language classifier is wrong, but generally it's accurate enough that its predictions can be relied upon. Also adding a Latin-ASCII-Simple transform which only does the punctuation portion of Latin-ASCII so it won't change anything substantial about the input string.	2016-08-20 18:17:46 -04:00
Al	da62ff309e	[transliteration] Fixing Malayalam script	2016-01-17 22:15:56 -05:00
Al	e9e05bb929	[transliteration] Distinguishing between variables with numbers and backreferences in transliteration rules	2015-12-23 13:07:44 -05:00
Al	e55ff54be1	[fix] Adding Korean-Latin-BGN to excluded transliterators	2015-12-21 16:24:50 -05:00
Al	682c316775	[transliteration] Removing Korean-Latin-BGN, not a great transliterator and AFAICT, ICU doesn't use it either	2015-12-21 12:45:45 -05:00
Al	ccf509edb1	[fix] update to control characters for generating the transliteration rules	2015-12-20 15:40:38 -05:00
Al	b2a944830a	[transliteration] Making sure the Python script to generate transliteration data works on the new CLDR format	2015-12-19 00:34:30 -05:00
Al	7f5cf89e84	[transliteration] Not escaping right side transliteration rules	2015-10-27 12:24:38 -04:00
Al	5417b4e602	[unicode] Downloading latest UnicodeData.txt instead of using builtin Python module (out of date) e.g. for getting unicode codepoint categories	2015-09-25 23:59:38 -04:00
Al	abfb1d4a60	[transliteration] Wide char support in transliteration data generator	2015-09-23 03:56:12 -04:00
Al	b4593b6f88	[unicode/tokenization] Using new character classes including wide chars in scanner	2015-09-23 00:33:14 -04:00
Al	cf70615850	[transliteration] Doing HTML escapes first in Latin-ASCII transliteration as they may need to be resolved further in subsequent steps	2015-08-11 23:10:55 -04:00
Al	a580ed0b1b	[transliteration] Adding numeric HTML escapes e.g. '&'	2015-06-29 15:02:34 -04:00
Al	8fb6a28e9c	[fix] using empty string instead of NULL for script languages so we can use fixed length arrays	2015-06-23 15:20:09 -05:00
Al	b21c3a3a2f	[transliteration] using different struct in script data header file	2015-06-22 22:06:16 -05:00
Al	c2b4744f55	[transliteration] Using a data file instead of a header for transliteration scripts	2015-06-21 05:37:56 -05:00
Al	84b9a6ff33	[transliteration] Adding Hangul-Latin and Jamo-Latin back into the mix with a restricted filter. Reversing all previous contexts by character group	2015-06-17 23:42:31 -04:00
Al	b2fe9d4db0	[transliteration] Adding uppercase umlauts and Scandinativan a-ring	2015-06-03 22:55:45 -04:00
Al	6ac4ff6021	[transliteration] Adding reverse/bidirectional transforms e.g. for Katakana-Latin	2015-05-31 02:07:36 -04:00
Al	9547c93a38	[fix] InterIndic-Latin is an internal transliterator, but needed for most of the Indic languages. Also fixing the string lengths for HTML entity replacements	2015-05-29 19:47:49 -04:00
Al	a278cfd12c	[transliteration] Using revisit strings instead of keeping a backtrack count so we don't have to later map logical characters to the actual string, removing any duplicate keys in the table builder so that if any rules happen to overlap within a step, the first will take precedence	2015-05-29 16:54:05 -04:00
Al	a9d5b91ac0	[transliteration] Not counting repeat character in group capture	2015-05-28 19:36:25 -04:00
Al	c00ecf6ea8	[fix] minimizing c* into (c\|'')+, using empty transition instead of zero-length string	2015-05-22 18:11:54 -04:00
Al	b2d15b29cf	[fix] greek_latin_ungegn => greek-latin-ungegn	2015-05-22 09:52:08 -04:00
Al	d65f7747f0	[transliteration] Adding html escapes as the first step in the Latin-ASCII transformation	2015-05-20 14:44:55 -04:00
Al	4694371cdc	[fix] unicode escaping the German transliterations	2015-05-18 13:55:57 -04:00
Al	e25f039ee4	[transliteration] Escaped single quotes in rules + ignoring rules with codepoints > \uffff	2015-05-17 18:31:35 -04:00
Al	d72348d47e	[transliteratin] Using a restricted set of diacritical marks relevant to Greek, variants stand in for transliterator dependencies e.g. use Katakana-Latin-BGN if Katakana-Latin cannot be found	2015-05-17 17:42:37 -04:00
Al	99115fa53c	[transliteration] converting one of the more complicated and frequently used rules to its utf8proc equivalent, adding better support for escaped unicode characters and set differences, generating a header file indicating which unicode script/language pairs warrant various transliterators.	2015-05-16 23:13:01 -04:00
Al	1f3ac0c3f9	[transliteration] using a proper lexer on the entire rule to correct some parses, allowing bracketed multiple characters in sets, fixing optionals	2015-05-14 16:34:03 -04:00
Al	304dc9525a	[transliteration] fixing variable assignments, literal wide characters (for narrow Python builds), ignoring rules related to spaced Han	2015-05-13 16:20:52 -04:00
Al	5bbf71ccbb	[transliteration] Using breadth-first search for tracking dependencies between transforms, removing Han-Spacedhan since our tokenizer does the equivalent already	2015-05-12 18:57:57 -04:00
Al	d5f9d8a29a	[mv] unicode_scripts => unicode_properties	2015-05-12 12:14:59 -04:00
Al	3814af52ec	[transliteration] Python script now implements the full TR-35 spec, including filter rules, which cuts down significantly on the size of the data file and complexity of generating the trie	2015-05-12 12:10:15 -04:00
Al	fe044cebef	[transliteration] char set mapping for some of the more complicated sets found in CLDR	2015-05-10 18:34:53 -04:00
Al	2a69488f9b	[fix] for transliteration rules, allowing the parsing of set differencees and arbitrarily nested character set expressions, using non-NUL byte for the empty transition. Adding resulting data file.	2015-05-08 17:14:26 -04:00
Al	10ebaf147a	[transliteration] literal ^ and $ escaped	2015-05-01 19:16:36 -04:00
Al	ff851a464c	[fix] escaping curly braces for regex compilation	2015-04-30 13:27:17 -04:00
Al	fa43abd8d9	[transliteration] For ruleset steps in transliteration, the name is just the step number, which can be appended to the trie as part of the key	2015-04-29 14:31:15 -04:00
Al	1c25238af7	[fix] string lengths on the various transliteration rules	2015-04-27 13:51:35 -04:00
Al	6ebea11640	[transliteration] fixing transliteration rules, fixing escape characters, adding sizes to all the strings as they may have null characters	2015-04-26 19:47:54 -04:00
Al	be29874f13	[transliteration] Parser for CLDR transforms to generate (simple) C transform rules	2015-04-25 15:42:21 -04:00

46 Commits