Al
|
79fd7a8ded
|
[tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string
|
2015-04-05 16:33:14 -04:00 |
|
Al
|
5f3d74de18
|
[fix] contiguous string array
|
2015-04-03 11:22:50 -04:00 |
|
Al
|
fcaeebd656
|
[dictionaries] fixes to French dictionary
|
2015-04-01 19:02:38 -04:00 |
|
Al
|
c81aa72254
|
[utils] a few changes to contiguous string arrays
|
2015-04-01 19:02:11 -04:00 |
|
Al
|
fa59b63ab2
|
[fix] type name/import
|
2015-04-01 02:54:14 -04:00 |
|
Al
|
310acbed2c
|
[phrases] Adding prefix-only trie searches, primarily with Germanic languages in mind (spelled out numbers, concatenated prefixes). Making the prefix/suffix APIs for single tokens more consistent with trie searches over longer strings/token arrays
|
2015-04-01 02:52:57 -04:00 |
|
Al
|
1ac4438e39
|
[utils] More consistent naming in string_utils
|
2015-03-27 21:12:08 -04:00 |
|
Al
|
70831b5005
|
[dictionaries] French elisions
|
2015-03-27 21:03:55 -04:00 |
|
Al
|
127a61d492
|
[utils] adding pop method on the improved vectors
|
2015-03-27 21:00:03 -04:00 |
|
Al
|
3678d4a3ca
|
[gazetteers] string name doesn't need to be part of the gazetteer itself, adding two new dictionary types to the config for named people (e.g. JFK) and named organizations (e.g. UN)
|
2015-03-27 20:59:21 -04:00 |
|
Al
|
4ccd1b1fe2
|
[fix] update feature arrays to use the new APIs
|
2015-03-27 20:57:42 -04:00 |
|
Al
|
6768936953
|
[utils] Adding vector_math.h with some inline methods for vector operations (sum, dot product, arithmetic, etc.). Works with kvec dynamic vectors.
|
2015-03-27 20:57:03 -04:00 |
|
Al
|
70195fffd5
|
[utils] new methods on string_utils for better dynamic strings which retains the benefits of sds without having to worry about the pointer changing, renaming contiguous string array methods to something more succinct
|
2015-03-27 20:55:36 -04:00 |
|
Al
|
2d1c24a6e9
|
[tokenization] Adding url, email, US/international phone numbers, a separate type for ideographic numbers, more general quotes, paren types
|
2015-03-24 16:43:53 -04:00 |
|
Al
|
50187f28ce
|
[fix] .txt extension
|
2015-03-23 02:17:07 -04:00 |
|
Al
|
7ffe788913
|
[unicode] header
|
2015-03-18 17:25:53 -04:00 |
|
Al
|
d5a9041cd3
|
[unicode] Adding generated unicode script data
|
2015-03-18 17:01:03 -04:00 |
|
Al
|
e03c1f21a7
|
[unicode] generate C headers/data files from unicode.org scripts
|
2015-03-18 16:59:58 -04:00 |
|
Al
|
6c8e5b45a4
|
[fix] removing building alias (for OSm it means building category), fix to fetch script
|
2015-03-18 08:40:07 -04:00 |
|
Al
|
88554c1ef7
|
[i18n] adding CLDR languages script to this repo
|
2015-03-18 08:01:36 -04:00 |
|
Al
|
d2ceb5f418
|
[fix] removing struct definition from scanner.re for future generation of scanner.c
|
2015-03-17 19:46:40 -04:00 |
|
Al
|
2cf909c01e
|
[utils] script utils
|
2015-03-17 18:39:08 -04:00 |
|
Al
|
f794ef7222
|
[tokenization] Exposing some of the scanner's methods in header for use in the Python scanner so it can avoid the additional allocation
|
2015-03-17 18:38:30 -04:00 |
|
Al
|
daf3f8706b
|
[utils] adding tab and comma constants to file_utils for parsing CSV/TSV files
|
2015-03-17 18:35:45 -04:00 |
|
Al
|
aeac0fe8c0
|
[geodata] Script to construct OSM training examples for building language dictionaries, disambiguating between abbreviations, classifying venues by type and formatting addresses for use in a sequence model with Lokku's address-formatting repo.
|
2015-03-17 18:11:07 -04:00 |
|
Al
|
0437271c92
|
[geodata] OSM planet fetch needs to convert ways/relations to nodes for all data sets
|
2015-03-17 16:51:17 -04:00 |
|
Al
|
f787851754
|
[unicode] Upgrading to JuliaLang's utf8proc (Unicode 7, maintained)
|
2015-03-17 12:20:08 -04:00 |
|
Al
|
621b25c964
|
[geodata] script to fetch/transform OSM planet (needs about 100GB of disk free) training language models
|
2015-03-16 00:45:14 -04:00 |
|
Al
|
26c2823208
|
[fix] comma
|
2015-03-14 18:58:18 -04:00 |
|
Al
|
0df849b440
|
[features] Feature array, a special case of contiguous string array for adding namespaced features in CRF-like sequence models
|
2015-03-14 18:37:41 -04:00 |
|
Al
|
3e20b4f600
|
[fix] Capturing GeoNames canonical and alternate names with a UNION ALL query, creating C headers with the field orderings for parsing the TSV file downstream
|
2015-03-14 18:02:14 -04:00 |
|
Al
|
284af74ba4
|
[geodisambig] Python scripts to prep GeoNames records for trie insertion
|
2015-03-13 11:56:48 -04:00 |
|
Al
|
53aa9bccb1
|
[geodisambig] adding MurmurHash3, used by the Bloom filter
|
2015-03-11 17:47:57 -04:00 |
|
Al
|
cf613ee475
|
[geodisambig] Bloom filter implementation for quick probabilistic set membership tests before hitting disk. 100% recall and bounded precision, saves disk seeks for keys that definitely do not exist (useful for Geonames disambiguation-related lookups and in-process deduping).
|
2015-03-11 17:47:15 -04:00 |
|
Al
|
eb391bf4d5
|
[dictionaries] Making address_components bit set a 16 bit int so we can bit pack trie values
|
2015-03-11 17:36:38 -04:00 |
|
Al
|
a446290829
|
[fix] IDEOGRAM class name
|
2015-03-11 17:33:53 -04:00 |
|
Al
|
a5f7c73374
|
[utils] is_relative_path
|
2015-03-11 17:31:08 -04:00 |
|
Al
|
5157a0fd8b
|
[utils] float and double arrays in collections.h
|
2015-03-11 17:30:26 -04:00 |
|
Al
|
94805fb1a7
|
[tokenization] Better scanner support for ideographic languages (Chinese, Japanese, Korean, etc.) with an IDEOGRAM token class in the scanner so we know when we're dealing with those languages vs. other random characters
|
2015-03-11 17:29:37 -04:00 |
|
Al
|
1dc0b8e07b
|
[dictionaries] Catalan dictionaries
|
2015-03-08 17:57:08 -04:00 |
|
Al
|
fce693a6b3
|
[dictionaries] additions to Portuguese dictionaries
|
2015-03-08 17:56:38 -04:00 |
|
Al
|
642d3697d4
|
[dictionaries] additions to German dictionaries, including a separable prefix dictionary
|
2015-03-08 17:55:57 -04:00 |
|
Al
|
38ec03bf2b
|
[phrases] default constructor for a trie uses a default alphabet derived from Wikipedia character frequencies for convenience. In practice the alphabet size/ordering matters only for very small tries or specialized alphabets. Mostly just use trie_new()
|
2015-03-05 13:40:52 -05:00 |
|
Al
|
939c3af293
|
[dictionaries] gazetteers.h has the config for in-memory dictionaries' directory structure
|
2015-03-04 16:01:16 -05:00 |
|
Al
|
7985a93963
|
[mv] Dutch concatenated suffixes
|
2015-03-04 01:21:37 -05:00 |
|
Al
|
b4bddfb510
|
[project] Making a work-in-progress note in the README
|
2015-03-03 23:35:15 -05:00 |
|
Al
|
163d8b7143
|
[dictionaries] first/last names apply to all languages. English gazetteers may potentially be used as a backup for all countries (most countries with non-Latin scripts transliterate, some actually translate the street name, usually to English)
|
2015-03-03 23:31:43 -05:00 |
|
Al
|
6d9c6a6fe7
|
[utils] geohash
|
2015-03-03 18:51:49 -05:00 |
|
Al
|
31910bd7b0
|
[dictionaries] fix for Dutch concatenated suffixes
|
2015-03-03 18:51:11 -05:00 |
|
Al
|
d5c14ca068
|
[dictionaries] Portuguese dictionaries
|
2015-03-03 18:46:44 -05:00 |
|