Al
|
cc4d2d08eb
|
[cldr] Adding script to download latest cldr release instead of pulling from the repo
|
2015-04-13 01:03:15 -04:00 |
|
Al
|
e241c1dfc8
|
[rm] Removing dependency on sds, char_array and cstring_array have similar benefits/functionality with fewer drawbacks
|
2015-04-12 18:07:33 -04:00 |
|
Al
|
83813bb980
|
[geodisambig] Models for geonames with msgpack serialization/deserialization
|
2015-04-12 16:47:01 -04:00 |
|
Al
|
acb575c84c
|
[fix] splitting out methods for unicode scripts
|
2015-04-12 15:21:23 -04:00 |
|
Al
|
1f9da05dd5
|
[geodisambig] C msgpack serialization dependency
|
2015-04-12 15:14:01 -04:00 |
|
Al
|
0234754c20
|
[fix] warnings in string_utils
|
2015-04-12 12:16:32 -04:00 |
|
Al
|
d50d7d182e
|
[fix] geonames import script for admin 1 codes
|
2015-04-12 12:16:08 -04:00 |
|
Al
|
888baa86f3
|
[fix] English dictionaries
|
2015-04-12 12:15:47 -04:00 |
|
Al
|
3a7f18581e
|
[utils] Adding min, max, argmin, argmax and log_sum_exp to generic vector math header
|
2015-04-12 12:11:04 -04:00 |
|
Al
|
fdd0c489f3
|
[fix] refactoring unicode script fetching into more reusable functions
|
2015-04-09 02:18:13 -04:00 |
|
Al
|
4729dfe178
|
[utils] string_[rl]strip => string_[rl]trim, removing warning about allocation
|
2015-04-06 02:19:19 -04:00 |
|
Al
|
53844067b1
|
[fix] better allocation sizes for tokenized strings
|
2015-04-05 22:02:31 -04:00 |
|
Al
|
198e51b8a3
|
[utils] more/better char_array methods
|
2015-04-05 22:01:46 -04:00 |
|
Al
|
79fd7a8ded
|
[tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string
|
2015-04-05 16:33:14 -04:00 |
|
Al
|
5f3d74de18
|
[fix] contiguous string array
|
2015-04-03 11:22:50 -04:00 |
|
Al
|
fcaeebd656
|
[dictionaries] fixes to French dictionary
|
2015-04-01 19:02:38 -04:00 |
|
Al
|
c81aa72254
|
[utils] a few changes to contiguous string arrays
|
2015-04-01 19:02:11 -04:00 |
|
Al
|
fa59b63ab2
|
[fix] type name/import
|
2015-04-01 02:54:14 -04:00 |
|
Al
|
310acbed2c
|
[phrases] Adding prefix-only trie searches, primarily with Germanic languages in mind (spelled out numbers, concatenated prefixes). Making the prefix/suffix APIs for single tokens more consistent with trie searches over longer strings/token arrays
|
2015-04-01 02:52:57 -04:00 |
|
Al
|
1ac4438e39
|
[utils] More consistent naming in string_utils
|
2015-03-27 21:12:08 -04:00 |
|
Al
|
70831b5005
|
[dictionaries] French elisions
|
2015-03-27 21:03:55 -04:00 |
|
Al
|
127a61d492
|
[utils] adding pop method on the improved vectors
|
2015-03-27 21:00:03 -04:00 |
|
Al
|
3678d4a3ca
|
[gazetteers] string name doesn't need to be part of the gazetteer itself, adding two new dictionary types to the config for named people (e.g. JFK) and named organizations (e.g. UN)
|
2015-03-27 20:59:21 -04:00 |
|
Al
|
4ccd1b1fe2
|
[fix] update feature arrays to use the new APIs
|
2015-03-27 20:57:42 -04:00 |
|
Al
|
6768936953
|
[utils] Adding vector_math.h with some inline methods for vector operations (sum, dot product, arithmetic, etc.). Works with kvec dynamic vectors.
|
2015-03-27 20:57:03 -04:00 |
|
Al
|
70195fffd5
|
[utils] new methods on string_utils for better dynamic strings which retains the benefits of sds without having to worry about the pointer changing, renaming contiguous string array methods to something more succinct
|
2015-03-27 20:55:36 -04:00 |
|
Al
|
2d1c24a6e9
|
[tokenization] Adding url, email, US/international phone numbers, a separate type for ideographic numbers, more general quotes, paren types
|
2015-03-24 16:43:53 -04:00 |
|
Al
|
50187f28ce
|
[fix] .txt extension
|
2015-03-23 02:17:07 -04:00 |
|
Al
|
7ffe788913
|
[unicode] header
|
2015-03-18 17:25:53 -04:00 |
|
Al
|
d5a9041cd3
|
[unicode] Adding generated unicode script data
|
2015-03-18 17:01:03 -04:00 |
|
Al
|
e03c1f21a7
|
[unicode] generate C headers/data files from unicode.org scripts
|
2015-03-18 16:59:58 -04:00 |
|
Al
|
6c8e5b45a4
|
[fix] removing building alias (for OSm it means building category), fix to fetch script
|
2015-03-18 08:40:07 -04:00 |
|
Al
|
88554c1ef7
|
[i18n] adding CLDR languages script to this repo
|
2015-03-18 08:01:36 -04:00 |
|
Al
|
d2ceb5f418
|
[fix] removing struct definition from scanner.re for future generation of scanner.c
|
2015-03-17 19:46:40 -04:00 |
|
Al
|
2cf909c01e
|
[utils] script utils
|
2015-03-17 18:39:08 -04:00 |
|
Al
|
f794ef7222
|
[tokenization] Exposing some of the scanner's methods in header for use in the Python scanner so it can avoid the additional allocation
|
2015-03-17 18:38:30 -04:00 |
|
Al
|
daf3f8706b
|
[utils] adding tab and comma constants to file_utils for parsing CSV/TSV files
|
2015-03-17 18:35:45 -04:00 |
|
Al
|
aeac0fe8c0
|
[geodata] Script to construct OSM training examples for building language dictionaries, disambiguating between abbreviations, classifying venues by type and formatting addresses for use in a sequence model with Lokku's address-formatting repo.
|
2015-03-17 18:11:07 -04:00 |
|
Al
|
0437271c92
|
[geodata] OSM planet fetch needs to convert ways/relations to nodes for all data sets
|
2015-03-17 16:51:17 -04:00 |
|
Al
|
f787851754
|
[unicode] Upgrading to JuliaLang's utf8proc (Unicode 7, maintained)
|
2015-03-17 12:20:08 -04:00 |
|
Al
|
621b25c964
|
[geodata] script to fetch/transform OSM planet (needs about 100GB of disk free) training language models
|
2015-03-16 00:45:14 -04:00 |
|
Al
|
26c2823208
|
[fix] comma
|
2015-03-14 18:58:18 -04:00 |
|
Al
|
0df849b440
|
[features] Feature array, a special case of contiguous string array for adding namespaced features in CRF-like sequence models
|
2015-03-14 18:37:41 -04:00 |
|
Al
|
3e20b4f600
|
[fix] Capturing GeoNames canonical and alternate names with a UNION ALL query, creating C headers with the field orderings for parsing the TSV file downstream
|
2015-03-14 18:02:14 -04:00 |
|
Al
|
284af74ba4
|
[geodisambig] Python scripts to prep GeoNames records for trie insertion
|
2015-03-13 11:56:48 -04:00 |
|
Al
|
53aa9bccb1
|
[geodisambig] adding MurmurHash3, used by the Bloom filter
|
2015-03-11 17:47:57 -04:00 |
|
Al
|
cf613ee475
|
[geodisambig] Bloom filter implementation for quick probabilistic set membership tests before hitting disk. 100% recall and bounded precision, saves disk seeks for keys that definitely do not exist (useful for Geonames disambiguation-related lookups and in-process deduping).
|
2015-03-11 17:47:15 -04:00 |
|
Al
|
eb391bf4d5
|
[dictionaries] Making address_components bit set a 16 bit int so we can bit pack trie values
|
2015-03-11 17:36:38 -04:00 |
|
Al
|
a446290829
|
[fix] IDEOGRAM class name
|
2015-03-11 17:33:53 -04:00 |
|
Al
|
a5f7c73374
|
[utils] is_relative_path
|
2015-03-11 17:31:08 -04:00 |
|