Al
|
0eb9ef5bdf
|
[tokenization] Regenerating scanner.c
|
2015-10-05 01:41:48 -04:00 |
|
Al
|
0aa6950b6c
|
[fix] abbreviations
|
2015-10-02 23:48:21 -04:00 |
|
Al
|
01856dd36d
|
[fix] acronyms
|
2015-10-01 00:24:04 -04:00 |
|
Al
|
562aeb497d
|
[tokenization] Regenerating scanner.c
|
2015-09-30 11:32:38 -04:00 |
|
Al
|
856198a352
|
[tokenization] Regenerated scanner.c
|
2015-09-26 02:27:45 -04:00 |
|
Al
|
f13e9fad90
|
[tokenization] Regenerated scanner.c
|
2015-09-23 00:33:27 -04:00 |
|
Al
|
71be52275d
|
[tokenization] Adding a version which of tokenize which keeps whitespace tokens
|
2015-07-21 17:26:20 -04:00 |
|
Al
|
a8b2fb5b90
|
[tokenization] Regenerating scanner file
|
2015-07-14 18:16:24 -04:00 |
|
Al
|
3279b31b09
|
[tokenization] Adding an acronym token type for things like U.N. so we can delete internal periods on those tokens
|
2015-06-29 03:00:46 -04:00 |
|
Al
|
2b69c185fa
|
[tokenization] Adding a tokenizer method for appending to an existing tokens array (e.g. can stop/start tokenizing on a script change)
|
2015-06-25 10:03:34 -04:00 |
|
Al
|
880d444881
|
[tokenization] Re-generating scanner
|
2015-06-16 12:52:37 -04:00 |
|
Al
|
1b33744956
|
[tokenization] Numeric tokens must end in number or letter
|
2015-04-22 14:55:18 -04:00 |
|
Al
|
606a669c01
|
[tokenization] breaking dashes or double hyphens break a word while other dashes don't
|
2015-04-17 19:14:42 -04:00 |
|
Al
|
6718182443
|
[tokenization] non-breaking dashes can be mid-word, em-dashes, etc. break words
|
2015-04-17 15:21:22 -04:00 |
|
Al
|
79fd7a8ded
|
[tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string
|
2015-04-05 16:33:14 -04:00 |
|
Al
|
2d1c24a6e9
|
[tokenization] Adding url, email, US/international phone numbers, a separate type for ideographic numbers, more general quotes, paren types
|
2015-03-24 16:43:53 -04:00 |
|
Al
|
f794ef7222
|
[tokenization] Exposing some of the scanner's methods in header for use in the Python scanner so it can avoid the additional allocation
|
2015-03-17 18:38:30 -04:00 |
|
Al
|
a446290829
|
[fix] IDEOGRAM class name
|
2015-03-11 17:33:53 -04:00 |
|
Al
|
94805fb1a7
|
[tokenization] Better scanner support for ideographic languages (Chinese, Japanese, Korean, etc.) with an IDEOGRAM token class in the scanner so we know when we're dealing with those languages vs. other random characters
|
2015-03-11 17:29:37 -04:00 |
|
Al
|
0689f936c9
|
[tokenization] scanner/tokenizer (generated with re2c)
|
2015-03-03 12:35:22 -05:00 |
|