Commit Graph

97 Commits

Author SHA1 Message Date
Al
465bcd46aa [fix] input file in OSM training data generator 2015-07-13 14:18:24 -04:00
Al
961606ac12 [fix] removing intermediate file in OSM fetch 2015-07-13 14:17:57 -04:00
Al
59bf23ae67 [osm] Planet admin bounds filter 2015-07-13 04:08:55 -04:00
Al
7c988fa717 [fix] imports 2015-07-13 01:50:42 -04:00
Al
e603bad9f3 [fix] adding admin_level to the allowed properties list for language polygons 2015-07-13 01:49:54 -04:00
Al
fcff210d77 [rtree] Language polygon index returns polygons from most specific admin level to least specific 2015-07-13 00:58:47 -04:00
Al
ec1e820268 [parsing] Changing to OpenCageData repo 2015-07-09 13:44:14 -04:00
Al
e64b6c3398 [geonames] NULL language and official language canonical should have the same sort value 2015-07-08 17:03:51 -04:00
Al
4a2be72350 [geonames] Adding language priorities for sorting (official language names, canonical names, abbreviations, historical) 2015-07-08 16:42:42 -04:00
Al
95a6845a85 [i18n] Adding regional languages as valid country languages 2015-07-08 14:54:00 -04:00
Al
ef1ecb97f7 [geonames] Adding geonames_id for countries in places/postal codes. For postal codes, sorting desc by country population (10013 is a postal code in Italy but will default to US with no other information) 2015-07-08 13:30:57 -04:00
Al
6cc677ac0b [geonames] Adding defaults to schema and another index on country code 2015-07-08 13:16:01 -04:00
Al
0c5e741bb6 [geonames] Adding LC_ALL environment variable for utf8 sorting 2015-07-06 00:39:23 -04:00
Al
acd5d07d17 [geonames] Storing NFD normalized names and sorting case-insensitive in order to group everything with the same normalized name together 2015-07-05 15:56:46 -04:00
Al
f825dcb939 [geonames] Fixing admin table DDL 2015-07-03 05:54:41 -04:00
Al
86b23ecca3 [fix] field name 2015-07-02 15:59:11 -04:00
Al
071d6bb392 [geodisambig] Adding presence of a Wikipedia link to the GeoNames output (an unqualified entry for the name in Wikipeida usually indicates a primary meaning). Ranking ambiguous entries for each term so that the top entry should be selected if no further information is available 2015-06-30 18:00:07 -04:00
Al
a580ed0b1b [transliteration] Adding numeric HTML escapes e.g. '&' 2015-06-29 15:02:34 -04:00
Al
8fb6a28e9c [fix] using empty string instead of NULL for script languages so we can use fixed length arrays 2015-06-23 15:20:09 -05:00
Al
b21c3a3a2f [transliteration] using different struct in script data header file 2015-06-22 22:06:16 -05:00
Al
c2b4744f55 [transliteration] Using a data file instead of a header for transliteration scripts 2015-06-21 05:37:56 -05:00
Al
b2e201f297 [fix] trailing comma 2015-06-20 15:14:41 -05:00
Al
d4087be40c [geonames] Pre-escaping tabs, no quoting in geonames/postal code TSVs 2015-06-20 11:54:47 -05:00
Al
ab1fb3669f [geonames] Only take alternative names that are != to the canonical name, sort by name, population desc, geonames_id 2015-06-19 15:47:50 -05:00
Al
84b9a6ff33 [transliteration] Adding Hangul-Latin and Jamo-Latin back into the mix with a restricted filter. Reversing all previous contexts by character group 2015-06-17 23:42:31 -04:00
Al
f04fad0e93 [i18n] Generating Hangul syllable classes 2015-06-16 12:50:48 -04:00
Al
cb2035867b [fix] osm geodata imports 2015-06-15 18:36:01 -04:00
Al
d2d25ead6f [utils] Adding unicode_csv module 2015-06-15 18:06:54 -04:00
Al
ccb64f7ac2 [polygons] Adding address_normalizer polygons package 2015-06-15 17:55:27 -04:00
Al
22fa81b33f [fix] __init__.py 2015-06-15 17:54:27 -04:00
Al
41dbd97bf2 [geodisambig] quattroshapes download can use default or specified location, unzips files 2015-06-15 17:54:08 -04:00
Al
037d4575ae [geodisambig] Modifying GeoNames TSV again. Using files again and sorting 2015-06-15 17:51:09 -04:00
Al
67bd9f1a31 [i18n] Adding languages.py 2015-06-15 17:48:47 -04:00
Al
073fe43698 [geodisambig] Adding quattroshapes download script 2015-06-15 17:46:11 -04:00
Al
73f37fe66b [fix] Moving default Geonames DB path to a shared module 2015-06-15 12:53:00 -04:00
Al
7a4fa7d443 [geodisambig] Canonical country names from CLDR, adding alpha-2 and alpha-3 surface forms, writing results to stdout or a file for streaming 2015-06-15 01:58:43 -04:00
Al
43e023077c [fix] Changing logging to stderr for the Geonames scripts 2015-06-14 15:38:57 -04:00
Al
fc735bb5c3 [numex] Adding a whole words only option on numex languages e.g. for Latin so we don't match an initial D with 500 2015-06-12 16:09:45 -04:00
Al
2d098fdab6 [numex] Adding ordinal_indicator rule type for CJK ordinals 2015-06-04 11:24:13 -04:00
Al
4c49f63caf [numex] Adding categories to numex for plurals, etc. Ordinal indicators support multiple variants (primer in Spanish can be written as 1er or 1r for instance) and longer suffixes e.g. for tracking 1=>1st but 11=>11th 2015-06-04 03:09:39 -04:00
Al
b2fe9d4db0 [transliteration] Adding uppercase umlauts and Scandinativan a-ring 2015-06-03 22:55:45 -04:00
Al
2ea21dfffb [fix] constants 2015-06-02 13:44:25 -04:00
Al
208366af98 [fix] removing stopwords index 2015-06-02 12:43:48 -04:00
Al
9d0d83bc14 [numex] adding stopword rules with the regular numex rules 2015-06-02 12:37:22 -04:00
Al
4ad978f22c [numex] Using the new representation for generated data 2015-06-02 12:28:07 -04:00
Al
2dc870b3da [numex] Python script to generate numex data 2015-06-02 10:15:02 -04:00
Al
6b3d434c31 [fix] removing unnecessary definition 2015-06-01 17:13:57 -04:00
Al
9c935c9cc7 [fix] Base data dir path 2015-06-01 17:13:06 -04:00
Al
6ac4ff6021 [transliteration] Adding reverse/bidirectional transforms e.g. for Katakana-Latin 2015-05-31 02:07:36 -04:00
Al
9547c93a38 [fix] InterIndic-Latin is an internal transliterator, but needed for most of the Indic languages. Also fixing the string lengths for HTML entity replacements 2015-05-29 19:47:49 -04:00