Al
|
67130383ce
|
[fix] converting semicolons to commas in OSM house numbers and picking one at random
|
2016-01-23 23:16:19 -05:00 |
|
Al
|
1bb797f783
|
[fix] spacing in phrases
|
2016-01-23 21:59:49 -05:00 |
|
Al
|
3a8c3dfcf6
|
[fix] spacing in phrases at end of string
|
2016-01-23 21:51:40 -05:00 |
|
Al
|
78450bfad9
|
[fix] Spaces in abbreviation
|
2016-01-23 21:36:20 -05:00 |
|
Al
|
308ceb5a5f
|
[fix] convert UTF8 slices back to unicode before using with the Python trie
|
2016-01-23 20:20:23 -05:00 |
|
Al
|
5eb6bb309b
|
[fix] Only adding whitespace back into tokenized strings during abbreviation if it existed in the original string
|
2016-01-23 20:09:45 -05:00 |
|
Al
|
d61207e95a
|
[fix] var name
|
2016-01-23 18:01:02 -05:00 |
|
Al
|
e44cba1d06
|
[fix] geonames db not required in OSM training data
|
2016-01-23 17:59:55 -05:00 |
|
Al
|
4f03711e60
|
[osm] Adding abbreviated training examples to ways language training data
|
2016-01-23 14:10:47 -05:00 |
|
Al
|
c9fb4ee69d
|
[osm/formatting] Dropping state more often than not, except in the US and Canada where those fields are more commonly used
|
2016-01-22 17:58:24 -05:00 |
|
Al
|
ea9bb3f2d5
|
[fix] Abbreviation probabilities should only apply once, not once per dictionary. Also fixing issues where some of the abbreviations were doubled
|
2016-01-22 15:48:21 -05:00 |
|
Al
|
f9f6558e06
|
[fix] simple whitespace field splits for the limited format training data (used for language classification)
|
2016-01-22 04:34:42 -05:00 |
|
Al
|
cd1db7b288
|
[fix] Making sure rare components are dropped first, adding state and country back in
|
2016-01-22 04:17:19 -05:00 |
|
Al
|
adc3a00264
|
[fix] var name
|
2016-01-22 04:10:16 -05:00 |
|
Al
|
261beffa36
|
[fix] Actually better to remove country and state from rare components and let them use the standard dropout probabilities
|
2016-01-22 04:00:45 -05:00 |
|
Al
|
a6cc3d0114
|
[fix] Adding state to the more frequently dropped components
|
2016-01-22 03:56:38 -05:00 |
|
Al
|
bca3dae004
|
[fix] state full name probabilities for limited vs. full formatted OSM training sets
|
2016-01-22 03:54:20 -05:00 |
|
Al
|
d1cf253092
|
[osm/formatting] Higher probability of dropout for rare components like counties, etc.
|
2016-01-22 03:39:35 -05:00 |
|
Al
|
9dd965a6fa
|
[fix] removing gazetteer configuration from disambiguation module
|
2016-01-22 03:18:18 -05:00 |
|
Al
|
b22646ee30
|
[mv] Moving gazetteers into their own module
|
2016-01-22 03:15:56 -05:00 |
|
Al
|
5a68e7aeef
|
[fix] import
|
2016-01-22 03:00:43 -05:00 |
|
Al
|
6ac72576bc
|
[osm/formatting] Randomly abbreviating street names and venue names using all the available libpostal dictionaries. Refactoring OSM formatting into separate methods which can be individually tested. Adding override for special phrases like UK
|
2016-01-22 02:56:39 -05:00 |
|
Al
|
f4995d4f0f
|
[languages] Adding several different types of dictionaries for name expansion/abbreviation in OSM
|
2016-01-22 00:51:32 -05:00 |
|
Al
|
89aa039692
|
[dictionaries] Adding some Italian month abbreviations
|
2016-01-21 15:12:46 -05:00 |
|
Al
|
26cbb1eb8d
|
[languages] Fixing multiple expansions in the same dictionary for Python trie, adding length for prefixes/suffixes
|
2016-01-21 04:29:14 -05:00 |
|
Al
|
0269d92e3d
|
[languages] Adding canonical string and dictionary type to Python trie, modifying disambiguate_languages accordingly, and adding lists of alternate forms
|
2016-01-21 02:30:59 -05:00 |
|
Al
|
2e15db06dd
|
[text] making normalize_string directly callable from Python geodata
|
2016-01-21 02:07:46 -05:00 |
|
Al
|
71e01e6133
|
[fix] prefix/suffix phrase search in Python trie search
|
2016-01-19 03:43:54 -05:00 |
|
Al
|
39667b73a2
|
[build] std=gnu99 in geodata build
|
2016-01-19 03:23:56 -05:00 |
|
Al
|
8b94a018e6
|
[languages] encoding in language disambiguation
|
2016-01-19 03:22:03 -05:00 |
|
Al
|
3262d2ccd3
|
[fix] arg count
|
2016-01-19 03:16:14 -05:00 |
|
Al
|
5d5d5713cc
|
[transliteration] Regenerating transliterator scripts
|
2016-01-18 12:04:14 -05:00 |
|
Al
|
fe8f3158f6
|
[fix] missing file in geodata
|
2016-01-17 22:23:44 -05:00 |
|
Al
|
5fd9dc7e2b
|
[scripts] relative dirs in setup.py for geodata
|
2016-01-17 22:22:50 -05:00 |
|
Al
|
da62ff309e
|
[transliteration] Fixing Malayalam script
|
2016-01-17 22:15:56 -05:00 |
|
Al
|
5385cb71d6
|
[languages] Adding English dictionaries to Indonesia
|
2016-01-17 22:08:06 -05:00 |
|
Al
|
8030b235e6
|
[languages] Changing the definition in script languages so only languages that appear on street signs will be used
|
2016-01-17 22:03:41 -05:00 |
|
Al
|
0dfd8d6439
|
[language_classification] Adding script feature for any non-Latin script. Even if the script doesn't directly identify the language, it can act as a modified intercept (all Han script addresses will share the Han feature, even if we haven't seen one of the > 80k Han characters)
|
2016-01-17 21:37:45 -05:00 |
|
Al
|
b9a3230f65
|
[language_classification] Removing the per-country classifier, text-based alone is doing close to 99% accuracy now
|
2016-01-17 21:13:14 -05:00 |
|
Al
|
f808f74271
|
[language_classification] Automatic hyperparameter optimization using either the cross-validation set or two distinct subsets of the training set
|
2016-01-17 21:11:37 -05:00 |
|
Al
|
af5689ee52
|
[fix] removing unused var
|
2016-01-17 21:00:17 -05:00 |
|
Al
|
7d727fc8f0
|
[optimization] Using adapted learning rate in stochastic gradient descent (if lambda > 0)
|
2016-01-17 20:59:47 -05:00 |
|
Al
|
7b300639f1
|
[fix] Trie prefix search tail comparison
|
2016-01-17 20:56:37 -05:00 |
|
Al
|
70dbfdd560
|
[unicode] Regenerating unicode_script_data.c
|
2016-01-17 20:53:44 -05:00 |
|
Al
|
de240d2b94
|
[fix] tokenize_add_tokens respects specified length
|
2016-01-17 20:51:47 -05:00 |
|
Al
|
10cadc67d7
|
[io] matrix_read using array I/O functions
|
2016-01-17 20:40:18 -05:00 |
|
Al
|
baba826d21
|
[io] Cutting down on system calls in trie_read
|
2016-01-17 20:39:19 -05:00 |
|
Al
|
cba2acc21f
|
[io] Sparse matrix using array I/O methods
|
2016-01-17 20:38:16 -05:00 |
|
Al
|
46b35c5202
|
[utils] Adding functions to read numeric arrays from files
|
2016-01-17 20:36:57 -05:00 |
|
Al
|
3d7dd8966e
|
[languages] Using unicode script in language disambiguation in addition to dictionaries. Eliminating dependency on address_normalizer
|
2016-01-17 18:28:28 -05:00 |
|