Commit Graph

47 Commits

Author SHA1 Message Date
Al
6c8e5b45a4 [fix] removing building alias (for OSm it means building category), fix to fetch script 2015-03-18 08:40:07 -04:00
Al
88554c1ef7 [i18n] adding CLDR languages script to this repo 2015-03-18 08:01:36 -04:00
Al
d2ceb5f418 [fix] removing struct definition from scanner.re for future generation of scanner.c 2015-03-17 19:46:40 -04:00
Al
2cf909c01e [utils] script utils 2015-03-17 18:39:08 -04:00
Al
f794ef7222 [tokenization] Exposing some of the scanner's methods in header for use in the Python scanner so it can avoid the additional allocation 2015-03-17 18:38:30 -04:00
Al
daf3f8706b [utils] adding tab and comma constants to file_utils for parsing CSV/TSV files 2015-03-17 18:35:45 -04:00
Al
aeac0fe8c0 [geodata] Script to construct OSM training examples for building language dictionaries, disambiguating between abbreviations, classifying venues by type and formatting addresses for use in a sequence model with Lokku's address-formatting repo. 2015-03-17 18:11:07 -04:00
Al
0437271c92 [geodata] OSM planet fetch needs to convert ways/relations to nodes for all data sets 2015-03-17 16:51:17 -04:00
Al
f787851754 [unicode] Upgrading to JuliaLang's utf8proc (Unicode 7, maintained) 2015-03-17 12:20:08 -04:00
Al
621b25c964 [geodata] script to fetch/transform OSM planet (needs about 100GB of disk free) training language models 2015-03-16 00:45:14 -04:00
Al
26c2823208 [fix] comma 2015-03-14 18:58:18 -04:00
Al
0df849b440 [features] Feature array, a special case of contiguous string array for adding namespaced features in CRF-like sequence models 2015-03-14 18:37:41 -04:00
Al
3e20b4f600 [fix] Capturing GeoNames canonical and alternate names with a UNION ALL query, creating C headers with the field orderings for parsing the TSV file downstream 2015-03-14 18:02:14 -04:00
Al
284af74ba4 [geodisambig] Python scripts to prep GeoNames records for trie insertion 2015-03-13 11:56:48 -04:00
Al
53aa9bccb1 [geodisambig] adding MurmurHash3, used by the Bloom filter 2015-03-11 17:47:57 -04:00
Al
cf613ee475 [geodisambig] Bloom filter implementation for quick probabilistic set membership tests before hitting disk. 100% recall and bounded precision, saves disk seeks for keys that definitely do not exist (useful for Geonames disambiguation-related lookups and in-process deduping). 2015-03-11 17:47:15 -04:00
Al
eb391bf4d5 [dictionaries] Making address_components bit set a 16 bit int so we can bit pack trie values 2015-03-11 17:36:38 -04:00
Al
a446290829 [fix] IDEOGRAM class name 2015-03-11 17:33:53 -04:00
Al
a5f7c73374 [utils] is_relative_path 2015-03-11 17:31:08 -04:00
Al
5157a0fd8b [utils] float and double arrays in collections.h 2015-03-11 17:30:26 -04:00
Al
94805fb1a7 [tokenization] Better scanner support for ideographic languages (Chinese, Japanese, Korean, etc.) with an IDEOGRAM token class in the scanner so we know when we're dealing with those languages vs. other random characters 2015-03-11 17:29:37 -04:00
Al
1dc0b8e07b [dictionaries] Catalan dictionaries 2015-03-08 17:57:08 -04:00
Al
fce693a6b3 [dictionaries] additions to Portuguese dictionaries 2015-03-08 17:56:38 -04:00
Al
642d3697d4 [dictionaries] additions to German dictionaries, including a separable prefix dictionary 2015-03-08 17:55:57 -04:00
Al
38ec03bf2b [phrases] default constructor for a trie uses a default alphabet derived from Wikipedia character frequencies for convenience. In practice the alphabet size/ordering matters only for very small tries or specialized alphabets. Mostly just use trie_new() 2015-03-05 13:40:52 -05:00
Al
939c3af293 [dictionaries] gazetteers.h has the config for in-memory dictionaries' directory structure 2015-03-04 16:01:16 -05:00
Al
7985a93963 [mv] Dutch concatenated suffixes 2015-03-04 01:21:37 -05:00
Al
b4bddfb510 [project] Making a work-in-progress note in the README 2015-03-03 23:35:15 -05:00
Al
163d8b7143 [dictionaries] first/last names apply to all languages. English gazetteers may potentially be used as a backup for all countries (most countries with non-Latin scripts transliterate, some actually translate the street name, usually to English) 2015-03-03 23:31:43 -05:00
Al
6d9c6a6fe7 [utils] geohash 2015-03-03 18:51:49 -05:00
Al
31910bd7b0 [dictionaries] fix for Dutch concatenated suffixes 2015-03-03 18:51:11 -05:00
Al
d5c14ca068 [dictionaries] Portuguese dictionaries 2015-03-03 18:46:44 -05:00
Al
fca161b2db [dictionaries] Dutch dictionaries 2015-03-03 18:46:26 -05:00
Al
b058e9e950 [dictionaries] Italian dictionaries 2015-03-03 18:46:11 -05:00
Al
837557ce97 [dictionaries] German dictionaries (including concatenated suffixes) 2015-03-03 18:45:42 -05:00
Al
c0c6ec5b85 [dictionaries] French dictionaries 2015-03-03 18:45:21 -05:00
Al
99816f55b1 [dictionaries] Spanish dictionaries 2015-03-03 18:45:04 -05:00
Al
ff55b2eace [dictionaries] English dictionaries 2015-03-03 18:44:36 -05:00
Al
5dd3896c4a [phrases] trie_search module for searching for millions of patterns in a trie simultanously. Works for strings, token sequences, and can search for suffixes. 2015-03-03 13:51:01 -05:00
Al
10777ce973 [fix] debug logging only in trie.c 2015-03-03 13:28:43 -05:00
Al
585baab0a5 [phrases] optimized implementation of a double-array trie for storing millions of phrases compactly while being extremely quick to access. Supports utf-8, stores phrase tails in a contiguous character array separated by NUL bytes and stores offsets only so the chars at that offset can be treated as a regular C string and fed to things like strncmp. Also stores suffixes (primarily for languages like German, Dutch, etc. that concatenate street names e.g. Foobarstraße, Fobarweg) by prefixing the reversed string with the NUL byte and storing it backward in the trie, so can search forward and backward with the same data structure. 2015-03-03 13:18:18 -05:00
Al
3ed5795cff [fix] fixing some formatting 2015-03-03 12:54:27 -05:00
Al
087328c321 [utils] logging 2015-03-03 12:38:10 -05:00
Al
09552906d3 [utils] util headers 2015-03-03 12:37:32 -05:00
Al
0689f936c9 [tokenization] scanner/tokenizer (generated with re2c) 2015-03-03 12:35:22 -05:00
Al
5216aba1b6 [utils] string utils, file utils, contiguous arrays of strings used for storing tokenized strings, klib for generic hashtables and vectors, antirez's sds for certain types of string building, utf8proc for iterating over utf-8 strings and unicode normalization 2015-03-03 12:33:13 -05:00
Al Barrentine
27269e18ca Initial commit 2015-03-02 19:21:31 -05:00