Commit Graph

4790 Commits

Author SHA1 Message Date
Al
39f59e7ecf [openaddresses] adding Mayenne, FR 2017-03-07 15:41:33 -05:00
Al
c2b516c761 [openaddresses] adding Hernando County, FL 2017-03-07 15:11:53 -05:00
Al
749bb4907e [openaddresses] adding city of Carlsbad, NM 2017-03-07 10:55:09 -05:00
Al
154fd42299 [openaddresses] adding city of Amarillo, TX 2017-03-07 10:53:52 -05:00
Al
95015990ab [parser] learning a sparser averaged perceptron model for the parser using the following method:
- store a vector of update counts for each feature in the model
- when the model updates after making a mistake, increment the update
  counters for the observed features in that example
- after the model is finished training, keep only the features that
  participated in a minimum number of updates

This method is described in greater detail in this paper from Yoav
Goldberg: https://www.cs.bgu.ac.il/~yoavg/publications/acl2011sparse.pdf

The authors there report a 4x size reduction at only a trivial cost in
terms of accuracy. So far the trials on libpostal indicate roughly the
same, though at lower training set sizes the accuracy cost is greater.

This method is more effective than simple feature pruning as feature
pruning methods are usually based on the frequency of the feature
in the training set, and infrequent features can still be important.
However, the perceptron's early iterations make many updates on
irrelevant featuers simply because the weights for the more relevant
features aren't tuned yet. The number of updates a feature participates
in can be seen as a measure of its relevance to classifying examples.

This commit introduces --min-features option to address_parser_train
(default=5), so it can effectively be turned off by using
"--min-features 0" or "--min-features 1".
2017-03-06 22:28:33 -05:00
Al
5c1c1ae0f2 [parser] moving tagger function pointer definition to a separate header so it can be used for other models 2017-03-06 21:42:06 -05:00
Al
cc58ec9db2 [parser] fix another valgrind error in parser training (cstring_array memory can get moved around when using string pointers obtained before adding to it, which can potentially cause a realloc), no longer using the dummy START tags as the feature function can choose to add features for those cases 2017-03-06 21:39:14 -05:00
Al
754f22c79a [parser] moving feature printing to averaged perceptron tagger, taking advantage of trie prefix-sharing in feature incorporating previous tags 2017-03-06 20:32:50 -05:00
Al
839a13577d [parser] fixing affix-related valgrind errors in address parser features 2017-03-06 20:28:42 -05:00
Al
c3581557a1 [parser] counting classes instead of keeping a set 2017-03-06 20:05:01 -05:00
Al
a5283cb313 [fix] trie_new_from_hash 2017-03-06 15:57:42 -05:00
Al
23ed916f09 [openaddresses] adding Hattiesburg, MS 2017-03-06 15:45:23 -05:00
Al
90cb4d904d [openaddresses] adding Longueuil, QC, Canada 2017-03-06 15:43:51 -05:00
Al
5113a1bc32 [utils] tracking keys added in trie construction from hash 2017-03-06 15:28:26 -05:00
Al
dd4f3eb84c [parser] simpler feature names for the state-transition features 2017-03-06 15:25:10 -05:00
Al
39fa8ff1a5 [parser] counting num classes in address parser init for models where it is needed a priori 2017-03-06 15:17:52 -05:00
Al
5f19e63cbe [parser] more logging in init 2017-03-06 15:11:39 -05:00
Al
4d2f77b3f3 [openaddresses] add city of Alexandria, LA 2017-03-06 14:30:25 -05:00
Al
bb922e4ce4 [parser] adding log message 2017-03-06 12:25:22 -05:00
Al
b97de96ab4 [parser] fixing chunked shuffle, making awk splitting work on Mac 2017-03-05 15:06:02 -05:00
Al
0e49fc580a [parser] uint64_t chunk size, no warning if gshuf is available 2017-03-05 14:50:47 -05:00
Al
d99f83b84a [openaddresses] add unit phrases in Cape Girardeau, MO 2017-03-05 04:00:41 -05:00
Al
d1bcced706 [openaddresses] adding some of the new Mississippi sources and city of Cape Girardeau, MO 2017-03-05 03:59:07 -05:00
Al
5d73aa1295 [fix] don't write formatted addresses in the ways-only data set unless the formatter returns non-None value 2017-03-05 03:50:00 -05:00
Al
b76b7b8527 [parser] adding chunked shuffle as a C function (writes each line to one of n random files, runs shuf on each file and concatenates the result). Adding a version which allows specifying a specific chunk size, and using a 2GB limit for address parser training. Allowing gshuf again for Mac as it seems the only problem there was not having enough memory when testing on a Mac laptop. The new limited-memory version should be fast enough. 2017-03-05 02:15:11 -05:00
Al
ba4052c9ba [openaddresses] add Muskogee, OK 2017-03-03 14:57:36 -05:00
Al
2704708f47 [openaddresses] add Orange County, NY 2017-03-03 14:27:05 -05:00
Al
da62fb62ba [openaddresses] adding Polk County, NC 2017-03-03 13:45:58 -05:00
Al
ce21635b00 [openaddresses] adding city of Salina, KS 2017-03-03 13:45:25 -05:00
Al
b4437848c4 [fix] override_country_dir 2017-03-02 14:31:53 -05:00
Al
69351cad98 [openaddresses] add Tippecanoe County, IN 2017-03-02 13:36:22 -05:00
Al
6b8b6982aa [addresses] more classmethods 2017-03-02 04:23:09 -05:00
Al
f7c8a63093 [addresses] making most of the methods on AddressComponents classmethods if possible so they can be accessed easily for sources not using OSM polygon lookup, etc. 2017-03-01 15:51:56 -05:00
Al
702901608b [openaddresses_uk] adding OpenAddresses UK as a data set. No lat/lons but it does have addresses, cities and postcodes 2017-03-01 15:44:25 -05:00
Al
375f7b1684 [addresses] making postcode before {suburb,city} more likely in the UK for #39 2017-03-01 15:43:26 -05:00
Al
a5d8700df3 [openaddresses] use override_country_dir config option in OA address formatter 2017-03-01 13:52:07 -05:00
Al
0890c712e2 [openaddresses] adding override_country_dir and country codes for Puerto Rico and French dependencies 2017-03-01 13:48:04 -05:00
Al
c80b771f94 [openaddresses] add override_country_dir in Puerto Rico 2017-03-01 13:45:44 -05:00
Al
dbc5d6b866 [openaddresses] remove OSM boundaries from East Peoria 2017-03-01 13:45:12 -05:00
Al
0d4c08d536 [openaddresses] ignore unit containing Fl in DeKalb county 2017-03-01 02:54:37 -05:00
Al
45e71a21bb [openaddresses] adding Kalamaria, Thessaloniki, Greece 2017-03-01 01:11:46 -05:00
Al
26f5c403d3 [openaddresses] add Henry County, GA 2017-02-28 23:01:28 -05:00
Al
f6e9cbf8a0 [openaddresses] adding Gwinnett County, GA 2017-02-28 22:52:11 -05:00
Al
b9424c6c69 [openaddresses] adding Cobb County, GA 2017-02-28 22:49:23 -05:00
Al
357af3d465 [openaddresses] adding unit to Fayette County, GA and adding a field map for the cities + no OSM boundaries 2017-02-28 22:43:19 -05:00
Al
c71fe9afbf [openaddresses] adding DeKalb County, GA 2017-02-28 18:53:51 -05:00
Al
412dd65d87 [openaddresses] adding Fayette County, GA 2017-02-28 18:45:42 -05:00
Al
e3cff74908 [openaddresses] add Tillamook County, OR 2017-02-28 11:57:43 -05:00
Al
a7813dda16 [openaddresses] adding Clayton County, GA 2017-02-27 12:53:49 -05:00
Al
f507f2bb3e [addresses] fix for Colombian house number formatting if the second regex group is not found 2017-02-25 23:24:06 -05:00