libpostal

Author	SHA1	Message	Date
Al	79fd7a8ded	[tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string	2015-04-05 16:33:14 -04:00
Al	5f3d74de18	[fix] contiguous string array	2015-04-03 11:22:50 -04:00
Al	fcaeebd656	[dictionaries] fixes to French dictionary	2015-04-01 19:02:38 -04:00
Al	c81aa72254	[utils] a few changes to contiguous string arrays	2015-04-01 19:02:11 -04:00
Al	fa59b63ab2	[fix] type name/import	2015-04-01 02:54:14 -04:00
Al	310acbed2c	[phrases] Adding prefix-only trie searches, primarily with Germanic languages in mind (spelled out numbers, concatenated prefixes). Making the prefix/suffix APIs for single tokens more consistent with trie searches over longer strings/token arrays	2015-04-01 02:52:57 -04:00
Al	1ac4438e39	[utils] More consistent naming in string_utils	2015-03-27 21:12:08 -04:00
Al	70831b5005	[dictionaries] French elisions	2015-03-27 21:03:55 -04:00
Al	127a61d492	[utils] adding pop method on the improved vectors	2015-03-27 21:00:03 -04:00
Al	3678d4a3ca	[gazetteers] string name doesn't need to be part of the gazetteer itself, adding two new dictionary types to the config for named people (e.g. JFK) and named organizations (e.g. UN)	2015-03-27 20:59:21 -04:00
Al	4ccd1b1fe2	[fix] update feature arrays to use the new APIs	2015-03-27 20:57:42 -04:00
Al	6768936953	[utils] Adding vector_math.h with some inline methods for vector operations (sum, dot product, arithmetic, etc.). Works with kvec dynamic vectors.	2015-03-27 20:57:03 -04:00
Al	70195fffd5	[utils] new methods on string_utils for better dynamic strings which retains the benefits of sds without having to worry about the pointer changing, renaming contiguous string array methods to something more succinct	2015-03-27 20:55:36 -04:00
Al	2d1c24a6e9	[tokenization] Adding url, email, US/international phone numbers, a separate type for ideographic numbers, more general quotes, paren types	2015-03-24 16:43:53 -04:00
Al	50187f28ce	[fix] .txt extension	2015-03-23 02:17:07 -04:00
Al	7ffe788913	[unicode] header	2015-03-18 17:25:53 -04:00
Al	d5a9041cd3	[unicode] Adding generated unicode script data	2015-03-18 17:01:03 -04:00
Al	e03c1f21a7	[unicode] generate C headers/data files from unicode.org scripts	2015-03-18 16:59:58 -04:00
Al	6c8e5b45a4	[fix] removing building alias (for OSm it means building category), fix to fetch script	2015-03-18 08:40:07 -04:00
Al	88554c1ef7	[i18n] adding CLDR languages script to this repo	2015-03-18 08:01:36 -04:00
Al	d2ceb5f418	[fix] removing struct definition from scanner.re for future generation of scanner.c	2015-03-17 19:46:40 -04:00
Al	2cf909c01e	[utils] script utils	2015-03-17 18:39:08 -04:00
Al	f794ef7222	[tokenization] Exposing some of the scanner's methods in header for use in the Python scanner so it can avoid the additional allocation	2015-03-17 18:38:30 -04:00
Al	daf3f8706b	[utils] adding tab and comma constants to file_utils for parsing CSV/TSV files	2015-03-17 18:35:45 -04:00
Al	aeac0fe8c0	[geodata] Script to construct OSM training examples for building language dictionaries, disambiguating between abbreviations, classifying venues by type and formatting addresses for use in a sequence model with Lokku's address-formatting repo.	2015-03-17 18:11:07 -04:00
Al	0437271c92	[geodata] OSM planet fetch needs to convert ways/relations to nodes for all data sets	2015-03-17 16:51:17 -04:00
Al	f787851754	[unicode] Upgrading to JuliaLang's utf8proc (Unicode 7, maintained)	2015-03-17 12:20:08 -04:00
Al	621b25c964	[geodata] script to fetch/transform OSM planet (needs about 100GB of disk free) training language models	2015-03-16 00:45:14 -04:00
Al	26c2823208	[fix] comma	2015-03-14 18:58:18 -04:00
Al	0df849b440	[features] Feature array, a special case of contiguous string array for adding namespaced features in CRF-like sequence models	2015-03-14 18:37:41 -04:00
Al	3e20b4f600	[fix] Capturing GeoNames canonical and alternate names with a UNION ALL query, creating C headers with the field orderings for parsing the TSV file downstream	2015-03-14 18:02:14 -04:00
Al	284af74ba4	[geodisambig] Python scripts to prep GeoNames records for trie insertion	2015-03-13 11:56:48 -04:00
Al	53aa9bccb1	[geodisambig] adding MurmurHash3, used by the Bloom filter	2015-03-11 17:47:57 -04:00
Al	cf613ee475	[geodisambig] Bloom filter implementation for quick probabilistic set membership tests before hitting disk. 100% recall and bounded precision, saves disk seeks for keys that definitely do not exist (useful for Geonames disambiguation-related lookups and in-process deduping).	2015-03-11 17:47:15 -04:00
Al	eb391bf4d5	[dictionaries] Making address_components bit set a 16 bit int so we can bit pack trie values	2015-03-11 17:36:38 -04:00
Al	a446290829	[fix] IDEOGRAM class name	2015-03-11 17:33:53 -04:00
Al	a5f7c73374	[utils] is_relative_path	2015-03-11 17:31:08 -04:00
Al	5157a0fd8b	[utils] float and double arrays in collections.h	2015-03-11 17:30:26 -04:00
Al	94805fb1a7	[tokenization] Better scanner support for ideographic languages (Chinese, Japanese, Korean, etc.) with an IDEOGRAM token class in the scanner so we know when we're dealing with those languages vs. other random characters	2015-03-11 17:29:37 -04:00
Al	1dc0b8e07b	[dictionaries] Catalan dictionaries	2015-03-08 17:57:08 -04:00
Al	fce693a6b3	[dictionaries] additions to Portuguese dictionaries	2015-03-08 17:56:38 -04:00
Al	642d3697d4	[dictionaries] additions to German dictionaries, including a separable prefix dictionary	2015-03-08 17:55:57 -04:00
Al	38ec03bf2b	[phrases] default constructor for a trie uses a default alphabet derived from Wikipedia character frequencies for convenience. In practice the alphabet size/ordering matters only for very small tries or specialized alphabets. Mostly just use trie_new()	2015-03-05 13:40:52 -05:00
Al	939c3af293	[dictionaries] gazetteers.h has the config for in-memory dictionaries' directory structure	2015-03-04 16:01:16 -05:00
Al	7985a93963	[mv] Dutch concatenated suffixes	2015-03-04 01:21:37 -05:00
Al	b4bddfb510	[project] Making a work-in-progress note in the README	2015-03-03 23:35:15 -05:00
Al	163d8b7143	[dictionaries] first/last names apply to all languages. English gazetteers may potentially be used as a backup for all countries (most countries with non-Latin scripts transliterate, some actually translate the street name, usually to English)	2015-03-03 23:31:43 -05:00
Al	6d9c6a6fe7	[utils] geohash	2015-03-03 18:51:49 -05:00
Al	31910bd7b0	[dictionaries] fix for Dutch concatenated suffixes	2015-03-03 18:51:11 -05:00
Al	d5c14ca068	[dictionaries] Portuguese dictionaries	2015-03-03 18:46:44 -05:00

1 2

65 Commits