Commit Graph

726 Commits

Author SHA1 Message Date
Al
040a26a6f2 [fix] import 2015-09-03 13:54:23 -04:00
Al
7787427c58 [fix] typo 2015-09-03 13:53:18 -04:00
Al
23633e95dd [osm] Only adding country default language toponyms to training data 2015-09-03 13:44:41 -04:00
Al
11c01f64d2 [osm] OrderedDict of attrs in OSM training data 2015-09-03 11:11:18 -04:00
Al
27eb4e4aed [osm] Adding a toponym language training set using planet-borders.osm (all admin borders) 2015-09-03 10:19:11 -04:00
Al
db57855c95 [osm] Switching formatter repo to the OpenVenues fork, with fixes and several dozen new countries added 2015-09-03 10:06:54 -04:00
Al
a916668f28 [i18n] Local file for ISO 15924 2015-09-01 23:58:36 -04:00
Al
ee4d73c65d [math] sparse matrix I/O methods 2015-09-01 00:29:11 -04:00
Al
a8f6617294 [phrases] Adding num_keys attribute to trie 2015-08-31 21:41:34 -04:00
Al
aac5b37e76 [fix] Removing default dirent include 2015-08-31 21:38:29 -04:00
Al
bb50c7ea2c [math] Adding sigmoid and softmax functions 2015-08-31 21:04:21 -04:00
Al
a090a22bca [math] Adding compressed sparse row (CSR) format sparse matrix, designed for dynamic construction, just the methods needed for logistic regression for now i.e. no sparse dot products 2015-08-31 16:42:41 -04:00
Al
0f617454d3 [math] Dense matrices 2015-08-31 14:57:11 -04:00
Al
0ee72b8dfb [math] can only use memset for *_array_new_zeros 2015-08-31 14:44:43 -04:00
Al
c566eaecf1 [dictionaries] Rebuilding address expansion data and uploading new files to S3 2015-08-31 14:33:28 -04:00
Al
789150ae33 [math] Using regular C arrays instead of vectors for vector_math.h 2015-08-30 02:41:31 -04:00
Al
07b0bed602 [math] Only float vectors have *_array_log, *_array_exp, etc. 2015-08-26 17:58:07 -04:00
Al
a2ec8001b0 [osm] Removing postal code keys in formatted language training data 2015-08-24 14:08:36 -04:00
Al
8bbcb60aee [languages] Moving search_suffix and search_prefix into methods 2015-08-24 14:04:36 -04:00
Al
c68f56e61d [fix] paths 2015-08-24 12:58:27 -04:00
Al
d620cb6fc3 [fix] Calculating splits in Python rather than bash 2015-08-24 12:47:51 -04:00
Al
c754d275af [fix] str 2015-08-24 12:24:55 -04:00
Al
96cb289b79 [languages] Script to create language training/cross-validation/test data splits 2015-08-24 12:18:23 -04:00
Al
fa7b855ecb [languages] Earlier exit on finding ambiguous script spans 2015-08-24 03:07:57 -04:00
Al
90f333b16c [languages] Adding English non-default dictionaries to a number of countries where English can be found in OSM 2015-08-24 02:49:49 -04:00
Al
e1d336716c [languages] Non-default language canonicals, more test cases 2015-08-24 02:21:53 -04:00
Al
c1ce91abbf [languages] Better handling of non-default langauge canonicals in default langauge text 2015-08-24 01:26:17 -04:00
Al
96d7b990b5 [fix] .items() 2015-08-23 23:39:30 -04:00
Al
9f6f4feea1 [dictionaries/languages] Adding English gazetteers for Bahrain, pas abbreviation for paseo 2015-08-23 23:32:34 -04:00
Al
84e0982cbc [languages] Allow stopwords to help disambiguate if they can, otherwise ignore them 2015-08-23 23:04:17 -04:00
Al
d14be57e73 [dictionaries] Adding exit as an English street type 2015-08-23 22:51:22 -04:00
Al
7053c6b60b [fix] language disambiguation 2015-08-23 22:50:27 -04:00
Al
e26776a5e9 [dictionaries] Occitan stopwords for disambiguating from French 2015-08-23 16:35:46 -04:00
Al
f6d84531bc [languages] If a non-Latin script in a string would prohibit the found language, return ambiguous. Adding some test cases for sanity checking the labeling 2015-08-23 16:34:26 -04:00
Al
b8e4c19146 [mv] Moving the get regional/country languages logic out of language polygons 2015-08-23 14:25:33 -04:00
Al
43178747f8 [languages] Using stopwords only to account for how ambiguous a phrase is, not for disambiguation 2015-08-23 04:28:44 -04:00
Al
d8763e9d6c [languages] Adding non-canonicals only for streets, prefixes and suffixes. Better handling of default langauges, abbreviations and ambiguity 2015-08-23 03:42:24 -04:00
Al
9c176961ff [dictionaries] Norwegian street types from the suffix dictionary 2015-08-23 02:32:44 -04:00
Al
122a81b610 [languages] non-default languages can still be labeled from > 1 char abbreviations if there's no evidence of other languages in the string. Adding Python version of get_string_script from the C lib 2015-08-23 02:26:06 -04:00
Al
a419dad630 [languages] Adding canonical back in to language disambiguation (for prefixes/suffixes too), using non-canonicals/abbreviations in non-default languages if there are no other abbreviations found, adding in stopwords dictionaries 2015-08-23 00:43:37 -04:00
Al
a7d9cc1782 [fix] No longer using abbreviations for default languages, can be stopwords, etc. 2015-08-22 23:34:15 -04:00
Al
0701bb6f08 [fix] import 2015-08-22 23:19:43 -04:00
Al
723058886a [languages] Disambiguation uses language defaults, unicode normalized canonicals are treated as canonicals 2015-08-22 23:18:09 -04:00
Al
6231e17f2b [languages] Disambiguation in language labeling better handles default languages and only uses canonical forms for non-default languages 2015-08-22 20:26:39 -04:00
Al
bf829f7cb6 [polygons] Adding a main to generate language polygons 2015-08-22 17:45:04 -04:00
Al
5c15c4a99f [languages] Adding non-default Spanish and French gazetteers to the US, and giving the country of Jersey shared English/French defaults instead of just English 2015-08-22 15:21:04 -04:00
Al
e70c2453ee [fix] import 2015-08-22 15:04:30 -04:00
Al
3902715258 [osm] Some countries like Lebanon in OSM will list the same address under two languages (French/English), which creates an unreasonable task for a linear classifier, so running disambiguation in those cases 2015-08-22 14:11:49 -04:00
Al
f6e521e3f3 [geonames] Adding covering index to geonames DB 2015-08-22 13:54:25 -04:00
Al
bd31dc99f2 [mv] csv_utils 2015-08-22 13:53:44 -04:00