libpostal

Author	SHA1	Message	Date
Al	1b33744956	[tokenization] Numeric tokens must end in number or letter	2015-04-22 14:55:18 -04:00
Al	606a669c01	[tokenization] breaking dashes or double hyphens break a word while other dashes don't	2015-04-17 19:14:42 -04:00
Al	6718182443	[tokenization] non-breaking dashes can be mid-word, em-dashes, etc. break words	2015-04-17 15:21:22 -04:00
Al	79fd7a8ded	[tokenization/trie] simpler url regex reduces the scanner file size, accounting for a few more variations in word tokens, making trie suffix search use iteration instead of malloc'ing a new string	2015-04-05 16:33:14 -04:00
Al	2d1c24a6e9	[tokenization] Adding url, email, US/international phone numbers, a separate type for ideographic numbers, more general quotes, paren types	2015-03-24 16:43:53 -04:00
Al	f794ef7222	[tokenization] Exposing some of the scanner's methods in header for use in the Python scanner so it can avoid the additional allocation	2015-03-17 18:38:30 -04:00
Al	a446290829	[fix] IDEOGRAM class name	2015-03-11 17:33:53 -04:00
Al	94805fb1a7	[tokenization] Better scanner support for ideographic languages (Chinese, Japanese, Korean, etc.) with an IDEOGRAM token class in the scanner so we know when we're dealing with those languages vs. other random characters	2015-03-11 17:29:37 -04:00
Al	0689f936c9	[tokenization] scanner/tokenizer (generated with re2c)	2015-03-03 12:35:22 -05:00