Al
|
448ca6a61a
|
[merge] merging commit from v1.1
|
2017-10-12 01:41:04 -04:00 |
|
Al
|
b1e178b7b2
|
[fix] is_numeric_token includes IDEOGRAPHIC_NUMBER
|
2017-02-12 15:11:56 -05:00 |
|
Al
|
83381e9d8a
|
[expand] Adding exception for a few types of special punctuation (ampersand, plus, pound sign) which should be left in the original string and separated by whitespace. Closes #84. Closes #85
|
2016-07-17 15:02:47 -04:00 |
|
Al
|
33e9a05ebf
|
[tokenization] is_whitespace
|
2016-01-05 16:40:35 -05:00 |
|
Al
|
77ccd975c4
|
[fix] #endif
|
2015-12-28 17:03:12 -05:00 |
|
Al
|
e4dba2297d
|
[mv] Moving token type checking to header
|
2015-12-28 01:17:33 -05:00 |
|
Al
|
aa39c45b87
|
[tokenization] skipping control characters in tokenization, comes up in OSM surprisingly
|
2015-10-04 18:25:50 -04:00 |
|
Al
|
9b69d1f67a
|
[fix] Removing C++ checks from all but the main API functions
|
2015-08-07 17:15:39 -04:00 |
|
Al
|
3279b31b09
|
[tokenization] Adding an acronym token type for things like U.N. so we can delete internal periods on those tokens
|
2015-06-29 03:00:46 -04:00 |
|
Al
|
77760f207c
|
[tokenization] Adding a Hangul syllable class in tokenization for syllables written out as Jamo
|
2015-06-16 12:52:04 -04:00 |
|
Al
|
2d1c24a6e9
|
[tokenization] Adding url, email, US/international phone numbers, a separate type for ideographic numbers, more general quotes, paren types
|
2015-03-24 16:43:53 -04:00 |
|
Al
|
a446290829
|
[fix] IDEOGRAM class name
|
2015-03-11 17:33:53 -04:00 |
|
Al
|
3ed5795cff
|
[fix] fixing some formatting
|
2015-03-03 12:54:27 -05:00 |
|
Al
|
5216aba1b6
|
[utils] string utils, file utils, contiguous arrays of strings used for storing tokenized strings, klib for generic hashtables and vectors, antirez's sds for certain types of string building, utf8proc for iterating over utf-8 strings and unicode normalization
|
2015-03-03 12:33:13 -05:00 |
|