libpostal

Author	SHA1	Message	Date
Al	5983cb6af0	[i18n] Adding NUM_SCRIPTS to the end of the scripts enum	2015-05-16 12:19:40 -04:00
Al	1f3ac0c3f9	[transliteration] using a proper lexer on the entire rule to correct some parses, allowing bracketed multiple characters in sets, fixing optionals	2015-05-14 16:34:03 -04:00
Al	304dc9525a	[transliteration] fixing variable assignments, literal wide characters (for narrow Python builds), ignoring rules related to spaced Han	2015-05-13 16:20:52 -04:00
Al	5bbf71ccbb	[transliteration] Using breadth-first search for tracking dependencies between transforms, removing Han-Spacedhan since our tokenizer does the equivalent already	2015-05-12 18:57:57 -04:00
Al	b55db5fcda	[fix] usage text	2015-05-12 12:15:51 -04:00
Al	d5f9d8a29a	[mv] unicode_scripts => unicode_properties	2015-05-12 12:14:59 -04:00
Al	ff0e7cb9e1	[i18n] downloading several files from the Unicode Character Database	2015-05-12 12:12:17 -04:00
Al	3814af52ec	[transliteration] Python script now implements the full TR-35 spec, including filter rules, which cuts down significantly on the size of the data file and complexity of generating the trie	2015-05-12 12:10:15 -04:00
Al	fe044cebef	[transliteration] char set mapping for some of the more complicated sets found in CLDR	2015-05-10 18:34:53 -04:00
Al	2a69488f9b	[fix] for transliteration rules, allowing the parsing of set differencees and arbitrarily nested character set expressions, using non-NUL byte for the empty transition. Adding resulting data file.	2015-05-08 17:14:26 -04:00
Al	10ebaf147a	[transliteration] literal ^ and $ escaped	2015-05-01 19:16:36 -04:00
Al	ff851a464c	[fix] escaping curly braces for regex compilation	2015-04-30 13:27:17 -04:00
Al	fa43abd8d9	[transliteration] For ruleset steps in transliteration, the name is just the step number, which can be appended to the trie as part of the key	2015-04-29 14:31:15 -04:00
Al	1c25238af7	[fix] string lengths on the various transliteration rules	2015-04-27 13:51:35 -04:00
Al	6ebea11640	[transliteration] fixing transliteration rules, fixing escape characters, adding sizes to all the strings as they may have null characters	2015-04-26 19:47:54 -04:00
Al	be29874f13	[transliteration] Parser for CLDR transforms to generate (simple) C transform rules	2015-04-25 15:42:21 -04:00
Al	24e62b1c6c	[tokenization] Script to generate TR-29 ranges for re2c scanner	2015-04-14 15:50:50 -04:00
Al	5fa03587fb	[cldr] simple Python scanner for creating dynamic scanners for CLDR rule parsing	2015-04-14 15:49:24 -04:00
Al	6e9295154a	[fix] local dirs for cldr data	2015-04-14 15:46:15 -04:00
Al	744231c148	[fix] cldr supplemental uses local copy	2015-04-13 19:03:44 -04:00
Al	a8b9981c9b	[fix] vars	2015-04-13 19:03:14 -04:00
Al	d1267145f7	[fix] args to wget	2015-04-13 19:02:50 -04:00
Al	d771da7c78	[i18n] unicode scripts file downloaded and cached locally	2015-04-13 19:02:29 -04:00
Al	cc4d2d08eb	[cldr] Adding script to download latest cldr release instead of pulling from the repo	2015-04-13 01:03:15 -04:00
Al	acb575c84c	[fix] splitting out methods for unicode scripts	2015-04-12 15:21:23 -04:00
Al	d50d7d182e	[fix] geonames import script for admin 1 codes	2015-04-12 12:16:08 -04:00
Al	fdd0c489f3	[fix] refactoring unicode script fetching into more reusable functions	2015-04-09 02:18:13 -04:00
Al	e03c1f21a7	[unicode] generate C headers/data files from unicode.org scripts	2015-03-18 16:59:58 -04:00
Al	6c8e5b45a4	[fix] removing building alias (for OSm it means building category), fix to fetch script	2015-03-18 08:40:07 -04:00
Al	88554c1ef7	[i18n] adding CLDR languages script to this repo	2015-03-18 08:01:36 -04:00
Al	2cf909c01e	[utils] script utils	2015-03-17 18:39:08 -04:00
Al	aeac0fe8c0	[geodata] Script to construct OSM training examples for building language dictionaries, disambiguating between abbreviations, classifying venues by type and formatting addresses for use in a sequence model with Lokku's address-formatting repo.	2015-03-17 18:11:07 -04:00
Al	0437271c92	[geodata] OSM planet fetch needs to convert ways/relations to nodes for all data sets	2015-03-17 16:51:17 -04:00
Al	621b25c964	[geodata] script to fetch/transform OSM planet (needs about 100GB of disk free) training language models	2015-03-16 00:45:14 -04:00
Al	26c2823208	[fix] comma	2015-03-14 18:58:18 -04:00
Al	3e20b4f600	[fix] Capturing GeoNames canonical and alternate names with a UNION ALL query, creating C headers with the field orderings for parsing the TSV file downstream	2015-03-14 18:02:14 -04:00
Al	284af74ba4	[geodisambig] Python scripts to prep GeoNames records for trie insertion	2015-03-13 11:56:48 -04:00

37 Commits