libpostal

Author	SHA1	Message	Date
Al	c39a19a352	[transliteration] New data file with the Greek/Katakana additins	2015-05-17 17:59:39 -04:00
Al	d72348d47e	[transliteratin] Using a restricted set of diacritical marks relevant to Greek, variants stand in for transliterator dependencies e.g. use Katakana-Latin-BGN if Katakana-Latin cannot be found	2015-05-17 17:42:37 -04:00
Al	30db201e8a	[fix] NUM_CHARS => NUM_CODEPOINTS	2015-05-17 13:53:19 -04:00
Al	1348cc8906	[transliteration] Switching the begin/end set chars	2015-05-17 12:02:46 -04:00
Al	f1cfb30209	[transliteration] generated scripts file	2015-05-17 00:00:14 -04:00
Al	b983a83a89	[transliteration] transliteration struct definitions, memory allocaiton, builder methods and I/O, stubbing transliterate method for the moment	2015-05-16 23:23:25 -04:00
Al	3a74a8c179	[transliteration] script to build transliteration table, trie, C structures, etc. from the rules	2015-05-16 23:22:16 -04:00
Al	65624c8985	[fix] vector_*_pop returns the element	2015-05-16 23:20:28 -04:00
Al	4a67294fbf	[phrases] adding get_prefix methods for tries, remove add_nodes_only, fixing a few things and inlining a few functions	2015-05-16 23:19:59 -04:00
Al	e8fdd4564d	[utils] adding string_tree for listing sets of token alternatives and string_tree_iterator to generate permutations over the strings, needed for transliteration and ambiguous address elements/place names	2015-05-16 23:16:10 -04:00
Al	f151a2232c	[transliteration] new transliteration rules data file	2015-05-16 23:14:47 -04:00
Al	99115fa53c	[transliteration] converting one of the more complicated and frequently used rules to its utf8proc equivalent, adding better support for escaped unicode characters and set differences, generating a header file indicating which unicode script/language pairs warrant various transliterators.	2015-05-16 23:13:01 -04:00
Al	5983cb6af0	[i18n] Adding NUM_SCRIPTS to the end of the scripts enum	2015-05-16 12:19:40 -04:00
Al	8699409f15	[transliteration] resulting data file	2015-05-14 16:34:49 -04:00
Al	1f3ac0c3f9	[transliteration] using a proper lexer on the entire rule to correct some parses, allowing bracketed multiple characters in sets, fixing optionals	2015-05-14 16:34:03 -04:00
Al	2d49369e78	[utils] Adding read/write for 64-bit ints to file_utils	2015-05-13 17:51:03 -04:00
Al	6898f8ecd9	[transliteration] Adding context types back to transtlieration rule struct since they don't matter in the actual transliteration table	2015-05-13 16:51:07 -04:00
Al	b777b60e07	[transliteration] new data file	2015-05-13 16:21:16 -04:00
Al	304dc9525a	[transliteration] fixing variable assignments, literal wide characters (for narrow Python builds), ignoring rules related to spaced Han	2015-05-13 16:20:52 -04:00
Al	cbe83376f2	[transliteration] Adding new, even smaller, generated data file	2015-05-12 18:58:38 -04:00
Al	5bbf71ccbb	[transliteration] Using breadth-first search for tracking dependencies between transforms, removing Han-Spacedhan since our tokenizer does the equivalent already	2015-05-12 18:57:57 -04:00
Al	b55db5fcda	[fix] usage text	2015-05-12 12:15:51 -04:00
Al	d5f9d8a29a	[mv] unicode_scripts => unicode_properties	2015-05-12 12:14:59 -04:00
Al	0984fb9ea4	[transliteration] new, more compact transliteration data file	2015-05-12 12:13:11 -04:00
Al	ff0e7cb9e1	[i18n] downloading several files from the Unicode Character Database	2015-05-12 12:12:17 -04:00
Al	3814af52ec	[transliteration] Python script now implements the full TR-35 spec, including filter rules, which cuts down significantly on the size of the data file and complexity of generating the trie	2015-05-12 12:10:15 -04:00
Al	fe044cebef	[transliteration] char set mapping for some of the more complicated sets found in CLDR	2015-05-10 18:34:53 -04:00
Al	2a69488f9b	[fix] for transliteration rules, allowing the parsing of set differencees and arbitrarily nested character set expressions, using non-NUL byte for the empty transition. Adding resulting data file.	2015-05-08 17:14:26 -04:00
Al	10ebaf147a	[transliteration] literal ^ and $ escaped	2015-05-01 19:16:36 -04:00
Al	ff851a464c	[fix] escaping curly braces for regex compilation	2015-04-30 13:27:17 -04:00
Al	fa43abd8d9	[transliteration] For ruleset steps in transliteration, the name is just the step number, which can be appended to the trie as part of the key	2015-04-29 14:31:15 -04:00
Al	1c25238af7	[fix] string lengths on the various transliteration rules	2015-04-27 13:51:35 -04:00
Al	1373843b86	[fix] setting last_node in tokenized trie search in the case where a prefix phrase matches but the longer string doesn't.	2015-04-27 01:49:08 -04:00
Al	b2ba629f95	[fix] trie_get methods just return node index rather than data value	2015-04-27 01:28:05 -04:00
Al	8fb9bacfa6	[phrases] New trie_add_nodes_only method for concatenating strings to the trie, plus boolean return values on trie_add_* APIs	2015-04-27 01:01:43 -04:00
Al	8bc77372ef	[phrases] exposing trie_add_at_index and trie_get_from_index for more control in the transliteration tries	2015-04-26 22:24:02 -04:00
Al	6ebea11640	[transliteration] fixing transliteration rules, fixing escape characters, adding sizes to all the strings as they may have null characters	2015-04-26 19:47:54 -04:00
Al	ff9b6735f8	[transliteration] Adding header + generated C data file for simplified transliteration rules	2015-04-25 15:44:36 -04:00
Al	be29874f13	[transliteration] Parser for CLDR transforms to generate (simple) C transform rules	2015-04-25 15:42:21 -04:00
Al	1b33744956	[tokenization] Numeric tokens must end in number or letter	2015-04-22 14:55:18 -04:00
Al	9c0126a01c	[utils] two set types in collections.h	2015-04-19 09:32:53 -04:00
Al	908e3dc03c	[phrases] trie_search now only takes the original string and the token array. Fixed a bug where certain phrases were being found in string search but not in tokenized search	2015-04-19 09:32:20 -04:00
Al	606a669c01	[tokenization] breaking dashes or double hyphens break a word while other dashes don't	2015-04-17 19:14:42 -04:00
Al	6718182443	[tokenization] non-breaking dashes can be mid-word, em-dashes, etc. break words	2015-04-17 15:21:22 -04:00
Al	e21873635c	[utils] Using token offsets to calculate lengths for contiguous string arrays, inlining a few functions	2015-04-15 20:17:03 -04:00
Al	24e62b1c6c	[tokenization] Script to generate TR-29 ranges for re2c scanner	2015-04-14 15:50:50 -04:00
Al	5fa03587fb	[cldr] simple Python scanner for creating dynamic scanners for CLDR rule parsing	2015-04-14 15:49:24 -04:00
Al	efdcbc9eef	[project] adding a Python .gitignore for scripts, Python lib, etc.	2015-04-14 15:48:43 -04:00
Al	6e9295154a	[fix] local dirs for cldr data	2015-04-14 15:46:15 -04:00
Al	744231c148	[fix] cldr supplemental uses local copy	2015-04-13 19:03:44 -04:00

1 2 3 4

181 Commits