Commit Graph

  • 8bc77372ef [phrases] exposing trie_add_at_index and trie_get_from_index for more control in the transliteration tries Al 2015-04-26 22:24:02 -04:00
  • 6ebea11640 [transliteration] fixing transliteration rules, fixing escape characters, adding sizes to all the strings as they may have null characters Al 2015-04-26 19:45:06 -04:00
  • ff9b6735f8 [transliteration] Adding header + generated C data file for simplified transliteration rules Al 2015-04-25 15:44:36 -04:00
  • be29874f13 [transliteration] Parser for CLDR transforms to generate (simple) C transform rules Al 2015-04-25 15:42:21 -04:00
  • 1b33744956 [tokenization] Numeric tokens must end in number or letter Al 2015-04-22 14:55:18 -04:00
  • 9c0126a01c [utils] two set types in collections.h Al 2015-04-19 09:32:53 -04:00
  • 908e3dc03c [phrases] trie_search now only takes the original string and the token array. Fixed a bug where certain phrases were being found in string search but not in tokenized search Al 2015-04-19 09:32:20 -04:00
  • 606a669c01 [tokenization] breaking dashes or double hyphens break a word while other dashes don't Al 2015-04-17 19:14:42 -04:00
  • 6718182443 [tokenization] non-breaking dashes can be mid-word, em-dashes, etc. break words Al 2015-04-17 15:20:31 -04:00
  • e21873635c [utils] Using token offsets to calculate lengths for contiguous string arrays, inlining a few functions Al 2015-04-15 20:16:58 -04:00
  • 24e62b1c6c [tokenization] Script to generate TR-29 ranges for re2c scanner Al 2015-04-14 15:50:36 -04:00
  • 5fa03587fb [cldr] simple Python scanner for creating dynamic scanners for CLDR rule parsing Al 2015-04-14 15:49:24 -04:00
  • efdcbc9eef [project] adding a Python .gitignore for scripts, Python lib, etc. Al 2015-04-14 15:48:43 -04:00
  • 6e9295154a [fix] local dirs for cldr data Al 2015-04-14 15:46:15 -04:00
  • 744231c148 [fix] cldr supplemental uses local copy Al 2015-04-13 19:03:44 -04:00
  • a8b9981c9b [fix] vars Al 2015-04-13 19:03:14 -04:00
  • d1267145f7 [fix] args to wget Al 2015-04-13 19:02:50 -04:00
  • d771da7c78 [i18n] unicode scripts file downloaded and cached locally Al 2015-04-13 19:01:46 -04:00
  • cc4d2d08eb [cldr] Adding script to download latest cldr release instead of pulling from the repo Al 2015-04-13 01:02:20 -04:00
  • e241c1dfc8 [rm] Removing dependency on sds, char_array and cstring_array have similar benefits/functionality with fewer drawbacks Al 2015-04-12 18:07:33 -04:00
  • 83813bb980 [geodisambig] Models for geonames with msgpack serialization/deserialization Al 2015-04-12 16:47:01 -04:00
  • acb575c84c [fix] splitting out methods for unicode scripts Al 2015-04-12 15:21:23 -04:00
  • 1f9da05dd5 [geodisambig] C msgpack serialization dependency Al 2015-04-12 15:13:53 -04:00
  • 0234754c20 [fix] warnings in string_utils Al 2015-04-12 12:16:32 -04:00
  • d50d7d182e [fix] geonames import script for admin 1 codes Al 2015-04-12 12:16:08 -04:00
  • 888baa86f3 [fix] English dictionaries Al 2015-04-12 12:15:47 -04:00
  • 3a7f18581e [utils] Adding min, max, argmin, argmax and log_sum_exp to generic vector math header Al 2015-04-12 12:08:38 -04:00
  • fdd0c489f3 [fix] refactoring unicode script fetching into more reusable functions Al 2015-04-09 02:18:13 -04:00
  • 4729dfe178 [utils] string_[rl]strip => string_[rl]trim, removing warning about allocation Al 2015-04-06 02:19:12 -04:00
  • 53844067b1 [fix] better allocation sizes for tokenized strings Al 2015-04-05 22:02:31 -04:00
  • 198e51b8a3 [utils] more/better char_array methods Al 2015-04-05 22:01:46 -04:00
  • 79fd7a8ded [tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string Al 2015-04-05 16:30:27 -04:00
  • 5f3d74de18 [fix] contiguous string array Al 2015-04-03 11:22:50 -04:00
  • fcaeebd656 [dictionaries] fixes to French dictionary Al 2015-04-01 19:02:38 -04:00
  • c81aa72254 [utils] a few changes to contiguous string arrays Al 2015-04-01 19:02:11 -04:00
  • fa59b63ab2 [fix] type name/import Al 2015-04-01 02:54:14 -04:00
  • 310acbed2c [phrases] Adding prefix-only trie searches, primarily with Germanic languages in mind (spelled out numbers, concatenated prefixes). Making the prefix/suffix APIs for single tokens more consistent with trie searches over longer strings/token arrays Al 2015-04-01 02:52:57 -04:00
  • 1ac4438e39 [utils] More consistent naming in string_utils Al 2015-03-27 21:12:08 -04:00
  • 70831b5005 [dictionaries] French elisions Al 2015-03-27 21:03:55 -04:00
  • 127a61d492 [utils] adding pop method on the improved vectors Al 2015-03-27 21:00:03 -04:00
  • 3678d4a3ca [gazetteers] string name doesn't need to be part of the gazetteer itself, adding two new dictionary types to the config for named people (e.g. JFK) and named organizations (e.g. UN) Al 2015-03-27 20:59:21 -04:00
  • 4ccd1b1fe2 [fix] update feature arrays to use the new APIs Al 2015-03-27 20:57:42 -04:00
  • 6768936953 [utils] Adding vector_math.h with some inline methods for vector operations (sum, dot product, arithmetic, etc.). Works with kvec dynamic vectors. Al 2015-03-27 20:57:03 -04:00
  • 70195fffd5 [utils] new methods on string_utils for better dynamic strings which retains the benefits of sds without having to worry about the pointer changing, renaming contiguous string array methods to something more succinct Al 2015-03-27 20:55:36 -04:00
  • 2d1c24a6e9 [tokenization] Adding url, email, US/international phone numbers, a separate type for ideographic numbers, more general quotes, paren types Al 2015-03-24 16:43:53 -04:00
  • 50187f28ce [fix] .txt extension Al 2015-03-23 02:17:07 -04:00
  • 7ffe788913 [unicode] header Al 2015-03-18 17:25:53 -04:00
  • d5a9041cd3 [unicode] Adding generated unicode script data Al 2015-03-18 17:00:54 -04:00
  • e03c1f21a7 [unicode] generate C headers/data files from unicode.org scripts Al 2015-03-18 16:59:58 -04:00
  • 6c8e5b45a4 [fix] removing building alias (for OSm it means building category), fix to fetch script Al 2015-03-18 08:40:07 -04:00
  • 88554c1ef7 [i18n] adding CLDR languages script to this repo Al 2015-03-18 08:01:36 -04:00
  • d2ceb5f418 [fix] removing struct definition from scanner.re for future generation of scanner.c Al 2015-03-17 19:46:40 -04:00
  • 2cf909c01e [utils] script utils Al 2015-03-17 18:39:08 -04:00
  • f794ef7222 [tokenization] Exposing some of the scanner's methods in header for use in the Python scanner so it can avoid the additional allocation Al 2015-03-17 18:38:30 -04:00
  • daf3f8706b [utils] adding tab and comma constants to file_utils for parsing CSV/TSV files Al 2015-03-17 18:35:45 -04:00
  • aeac0fe8c0 [geodata] Script to construct OSM training examples for building language dictionaries, disambiguating between abbreviations, classifying venues by type and formatting addresses for use in a sequence model with Lokku's address-formatting repo. Al 2015-03-17 18:10:48 -04:00
  • 0437271c92 [geodata] OSM planet fetch needs to convert ways/relations to nodes for all data sets Al 2015-03-17 12:21:19 -04:00
  • f787851754 [unicode] Upgrading to JuliaLang's utf8proc (Unicode 7, maintained) Al 2015-03-17 12:20:08 -04:00
  • 621b25c964 [geodata] script to fetch/transform OSM planet (needs about 100GB of disk free) training language models Al 2015-03-15 13:19:03 -04:00
  • 26c2823208 [fix] comma Al 2015-03-14 18:58:18 -04:00
  • 0df849b440 [features] Feature array, a special case of contiguous string array for adding namespaced features in CRF-like sequence models Al 2015-03-14 18:37:41 -04:00
  • 3e20b4f600 [fix] Capturing GeoNames canonical and alternate names with a UNION ALL query, creating C headers with the field orderings for parsing the TSV file downstream Al 2015-03-14 18:02:14 -04:00
  • 284af74ba4 [geodisambig] Python scripts to prep GeoNames records for trie insertion Al 2015-03-13 11:56:48 -04:00
  • 53aa9bccb1 [geodisambig] adding MurmurHash3, used by the Bloom filter Al 2015-03-11 17:47:53 -04:00
  • cf613ee475 [geodisambig] Bloom filter implementation for quick probabilistic set membership tests before hitting disk. 100% recall and bounded precision, saves disk seeks for keys that definitely do not exist (useful for Geonames disambiguation-related lookups and in-process deduping). Al 2015-03-11 17:47:15 -04:00
  • eb391bf4d5 [dictionaries] Making address_components bit set a 16 bit int so we can bit pack trie values Al 2015-03-11 17:36:38 -04:00
  • a446290829 [fix] IDEOGRAM class name Al 2015-03-11 17:33:53 -04:00
  • a5f7c73374 [utils] is_relative_path Al 2015-03-11 17:31:08 -04:00
  • 5157a0fd8b [utils] float and double arrays in collections.h Al 2015-03-11 17:30:26 -04:00
  • 94805fb1a7 [tokenization] Better scanner support for ideographic languages (Chinese, Japanese, Korean, etc.) with an IDEOGRAM token class in the scanner so we know when we're dealing with those languages vs. other random characters Al 2015-03-11 17:29:37 -04:00
  • 1dc0b8e07b [dictionaries] Catalan dictionaries Al 2015-03-08 17:57:08 -04:00
  • fce693a6b3 [dictionaries] additions to Portuguese dictionaries Al 2015-03-08 17:56:38 -04:00
  • 642d3697d4 [dictionaries] additions to German dictionaries, including a separable prefix dictionary Al 2015-03-08 17:55:57 -04:00
  • 38ec03bf2b [phrases] default constructor for a trie uses a default alphabet derived from Wikipedia character frequencies for convenience. In practice the alphabet size/ordering matters only for very small tries or specialized alphabets. Mostly just use trie_new() Al 2015-03-05 13:31:25 -05:00
  • 939c3af293 [dictionaries] gazetteers.h has the config for in-memory dictionaries' directory structure Al 2015-03-04 16:01:16 -05:00
  • 7985a93963 [mv] Dutch concatenated suffixes Al 2015-03-04 01:21:37 -05:00
  • b4bddfb510 [project] Making a work-in-progress note in the README Al 2015-03-03 23:35:15 -05:00
  • 163d8b7143 [dictionaries] first/last names apply to all languages. English gazetteers may potentially be used as a backup for all countries (most countries with non-Latin scripts transliterate, some actually translate the street name, usually to English) Al 2015-03-03 23:31:43 -05:00
  • 6d9c6a6fe7 [utils] geohash Al 2015-03-03 18:51:49 -05:00
  • 31910bd7b0 [dictionaries] fix for Dutch concatenated suffixes Al 2015-03-03 18:51:11 -05:00
  • d5c14ca068 [dictionaries] Portuguese dictionaries Al 2015-03-03 18:46:44 -05:00
  • fca161b2db [dictionaries] Dutch dictionaries Al 2015-03-03 18:46:26 -05:00
  • b058e9e950 [dictionaries] Italian dictionaries Al 2015-03-03 18:46:11 -05:00
  • 837557ce97 [dictionaries] German dictionaries (including concatenated suffixes) Al 2015-03-03 18:45:37 -05:00
  • c0c6ec5b85 [dictionaries] French dictionaries Al 2015-03-03 18:45:21 -05:00
  • 99816f55b1 [dictionaries] Spanish dictionaries Al 2015-03-03 18:45:04 -05:00
  • ff55b2eace [dictionaries] English dictionaries Al 2015-03-03 18:44:36 -05:00
  • 5dd3896c4a [phrases] trie_search module for searching for millions of patterns in a trie simultanously. Works for strings, token sequences, and can search for suffixes. Al 2015-03-03 13:51:01 -05:00
  • 10777ce973 [fix] debug logging only in trie.c Al 2015-03-03 13:28:29 -05:00
  • 585baab0a5 [phrases] optimized implementation of a double-array trie for storing millions of phrases compactly while being extremely quick to access. Supports utf-8, stores phrase tails in a contiguous character array separated by NUL bytes and stores offsets only so the chars at that offset can be treated as a regular C string and fed to things like strncmp. Also stores suffixes (primarily for languages like German, Dutch, etc. that concatenate street names e.g. Foobarstraße, Fobarweg) by prefixing the reversed string with the NUL byte and storing it backward in the trie, so can search forward and backward with the same data structure. Al 2015-03-03 13:18:18 -05:00
  • 3ed5795cff [fix] fixing some formatting Al 2015-03-03 12:44:52 -05:00
  • 087328c321 [utils] logging Al 2015-03-03 12:38:10 -05:00
  • 09552906d3 [utils] util headers Al 2015-03-03 12:37:32 -05:00
  • 0689f936c9 [tokenization] scanner/tokenizer (generated with re2c) Al 2015-03-03 12:35:22 -05:00
  • 5216aba1b6 [utils] string utils, file utils, contiguous arrays of strings used for storing tokenized strings, klib for generic hashtables and vectors, antirez's sds for certain types of string building, utf8proc for iterating over utf-8 strings and unicode normalization Al 2015-03-03 12:27:19 -05:00
  • 27269e18ca Initial commit Al Barrentine 2015-03-02 19:21:31 -05:00