Commit Graph

  • a63c182e96 [parser] right context affixes need to use pre-normalized words as well Al 2017-03-08 13:51:36 -05:00
  • ce9153d94d [parser] fixing some issues in address_parser_features. Prefix/suffix phrases use the word before token-level normalization (but after string-level normalization like lowercasing), needed to use the same string in the feature function as in address_parser_context_fill. Affects some German suffixes like "str." where the final "." would be deleted in token normalization, but the suffix length would include it. Also, three of the new arrays used in address_parser_context (suffix_phrases, prefix_phrases, and sub_tokens) weren't being cleared per call, which means computing the wrong features at best and a segfault at worst Al 2017-03-07 17:30:53 -05:00
  • b6bf8da383 [utils] adding aligned malloc/free/realloc in vector.h and matrix.h, fixing bug in matrix_copy Al 2017-03-07 16:25:34 -05:00
  • 242b1364ae [parser] using new API in address_parser_test Al 2017-03-07 16:24:34 -05:00
  • 39f59e7ecf [openaddresses] adding Mayenne, FR Al 2017-03-07 15:41:33 -05:00
  • c2b516c761 [openaddresses] adding Hernando County, FL Al 2017-03-07 15:11:53 -05:00
  • 749bb4907e [openaddresses] adding city of Carlsbad, NM Al 2017-03-07 10:55:09 -05:00
  • 154fd42299 [openaddresses] adding city of Amarillo, TX Al 2017-03-07 10:53:52 -05:00
  • 95015990ab [parser] learning a sparser averaged perceptron model for the parser using the following method: Al 2017-03-06 21:56:10 -05:00
  • 5c1c1ae0f2 [parser] moving tagger function pointer definition to a separate header so it can be used for other models Al 2017-03-06 21:42:06 -05:00
  • cc58ec9db2 [parser] fix another valgrind error in parser training (cstring_array memory can get moved around when using string pointers obtained before adding to it, which can potentially cause a realloc), no longer using the dummy START tags as the feature function can choose to add features for those cases Al 2017-03-06 21:39:14 -05:00
  • 754f22c79a [parser] moving feature printing to averaged perceptron tagger, taking advantage of trie prefix-sharing in feature incorporating previous tags Al 2017-03-06 20:32:50 -05:00
  • 839a13577d [parser] fixing affix-related valgrind errors in address parser features Al 2017-03-06 20:28:42 -05:00
  • c3581557a1 [parser] counting classes instead of keeping a set Al 2017-03-06 20:05:01 -05:00
  • a5283cb313 [fix] trie_new_from_hash Al 2017-03-06 15:57:42 -05:00
  • 23ed916f09 [openaddresses] adding Hattiesburg, MS Al 2017-03-06 15:45:23 -05:00
  • 90cb4d904d [openaddresses] adding Longueuil, QC, Canada Al 2017-03-06 15:43:51 -05:00
  • 5113a1bc32 [utils] tracking keys added in trie construction from hash Al 2017-03-06 15:28:26 -05:00
  • dd4f3eb84c [parser] simpler feature names for the state-transition features Al 2017-03-06 15:25:10 -05:00
  • 39fa8ff1a5 [parser] counting num classes in address parser init for models where it is needed a priori Al 2017-03-06 15:17:52 -05:00
  • 5f19e63cbe [parser] more logging in init Al 2017-03-06 15:11:39 -05:00
  • 4d2f77b3f3 [openaddresses] add city of Alexandria, LA Al 2017-03-06 14:30:25 -05:00
  • bb922e4ce4 [parser] adding log message Al 2017-03-06 12:25:15 -05:00
  • b97de96ab4 [parser] fixing chunked shuffle, making awk splitting work on Mac Al 2017-03-05 15:05:59 -05:00
  • 0e49fc580a [parser] uint64_t chunk size, no warning if gshuf is available Al 2017-03-05 14:50:47 -05:00
  • d99f83b84a [openaddresses] add unit phrases in Cape Girardeau, MO Al 2017-03-05 04:00:41 -05:00
  • d1bcced706 [openaddresses] adding some of the new Mississippi sources and city of Cape Girardeau, MO Al 2017-03-05 03:59:07 -05:00
  • 5d73aa1295 [fix] don't write formatted addresses in the ways-only data set unless the formatter returns non-None value Al 2017-03-05 03:50:00 -05:00
  • b76b7b8527 [parser] adding chunked shuffle as a C function (writes each line to one of n random files, runs shuf on each file and concatenates the result). Adding a version which allows specifying a specific chunk size, and using a 2GB limit for address parser training. Allowing gshuf again for Mac as it seems the only problem there was not having enough memory when testing on a Mac laptop. The new limited-memory version should be fast enough. Al 2017-03-05 02:15:03 -05:00
  • ba4052c9ba [openaddresses] add Muskogee, OK Al 2017-03-03 14:57:36 -05:00
  • 2704708f47 [openaddresses] add Orange County, NY Al 2017-03-03 14:27:05 -05:00
  • da62fb62ba [openaddresses] adding Polk County, NC Al 2017-03-03 13:45:58 -05:00
  • ce21635b00 [openaddresses] adding city of Salina, KS Al 2017-03-03 13:45:25 -05:00
  • b4437848c4 [fix] override_country_dir Al 2017-03-02 14:31:53 -05:00
  • 69351cad98 [openaddresses] add Tippecanoe County, IN Al 2017-03-02 13:36:22 -05:00
  • 6b8b6982aa [addresses] more classmethods Al 2017-03-02 04:23:09 -05:00
  • f7c8a63093 [addresses] making most of the methods on AddressComponents classmethods if possible so they can be accessed easily for sources not using OSM polygon lookup, etc. Al 2017-03-01 15:51:56 -05:00
  • 702901608b [openaddresses_uk] adding OpenAddresses UK as a data set. No lat/lons but it does have addresses, cities and postcodes Al 2017-03-01 15:44:25 -05:00
  • 375f7b1684 [addresses] making postcode before {suburb,city} more likely in the UK for #39 Al 2017-03-01 15:43:26 -05:00
  • a5d8700df3 [openaddresses] use override_country_dir config option in OA address formatter Al 2017-03-01 13:52:07 -05:00
  • 0890c712e2 [openaddresses] adding override_country_dir and country codes for Puerto Rico and French dependencies Al 2017-03-01 13:48:04 -05:00
  • c80b771f94 [openaddresses] add override_country_dir in Puerto Rico Al 2017-03-01 13:45:44 -05:00
  • dbc5d6b866 [openaddresses] remove OSM boundaries from East Peoria Al 2017-03-01 13:45:12 -05:00
  • 0d4c08d536 [openaddresses] ignore unit containing Fl in DeKalb county Al 2017-03-01 02:54:37 -05:00
  • 45e71a21bb [openaddresses] adding Kalamaria, Thessaloniki, Greece Al 2017-03-01 01:11:46 -05:00
  • 26f5c403d3 [openaddresses] add Henry County, GA Al 2017-02-28 23:01:28 -05:00
  • f6e9cbf8a0 [openaddresses] adding Gwinnett County, GA Al 2017-02-28 22:52:11 -05:00
  • b9424c6c69 [openaddresses] adding Cobb County, GA Al 2017-02-28 22:49:23 -05:00
  • 357af3d465 [openaddresses] adding unit to Fayette County, GA and adding a field map for the cities + no OSM boundaries Al 2017-02-28 22:43:19 -05:00
  • c71fe9afbf [openaddresses] adding DeKalb County, GA Al 2017-02-28 18:53:51 -05:00
  • 412dd65d87 [openaddresses] adding Fayette County, GA Al 2017-02-28 18:45:42 -05:00
  • e3cff74908 [openaddresses] add Tillamook County, OR Al 2017-02-28 11:57:43 -05:00
  • a7813dda16 [openaddresses] adding Clayton County, GA Al 2017-02-27 12:53:49 -05:00
  • f507f2bb3e [addresses] fix for Colombian house number formatting if the second regex group is not found Al 2017-02-25 23:24:06 -05:00
  • 64d0783e73 [addresses] Chinese and Colombian house number regex changes Al 2017-02-25 23:19:12 -05:00
  • 7d699c52b8 [openaddresses] add Chinese name for Wuhan, OSM uses Chinese / English for the name Al 2017-02-25 22:27:55 -05:00
  • 68afed1658 [fix] typo Al 2017-02-25 17:52:20 -05:00
  • fdb07d7898 [openaddresses] add Laval, QC Al 2017-02-25 17:23:33 -05:00
  • c744edce12 [openaddresses] add Moore and Montgomerey counties, TX Al 2017-02-25 14:21:36 -05:00
  • 49fe1db613 [openaddresses] adding Vernon County, MO Al 2017-02-24 16:31:31 -05:00
  • d4de170c94 [openaddresses] adding city of Monroe, MI Al 2017-02-24 13:57:57 -05:00
  • d0679294bf [openaddresses] adding positional args so OpenAddresses ingestion can be run only for specific countries, subdirs, or individual files. Al 2017-02-24 03:39:21 -05:00
  • e39d4d2f00 [parser] check for non-null prev/prev2 before creating tag-based features Al 2017-02-24 02:57:16 -05:00
  • 182d60b623 [fix] removing include Al 2017-02-23 22:45:03 -05:00
  • 6097eacfef [fix] ignore fields in Kauai containing \n Al 2017-02-23 16:34:34 -05:00
  • 033e8dbb58 [openaddresses] adding Kauai and some component additions for Maui Al 2017-02-23 16:26:50 -05:00
  • fa7446deb6 [fix] district field for Wuhan data set Al 2017-02-23 02:15:55 -05:00
  • f006bba345 [openaddresses] adding city of Medellín, Colombia Al 2017-02-22 19:01:26 -08:00
  • 2d59450a51 [openaddresses] adding new Oregon counties Al 2017-02-22 09:59:20 -08:00
  • 79c2429bba [addresses] strip phrases like "# 123" off of English street names if they follow a thoroughfare/post-directional phrase whose expansion does not contain highway/route Al 2017-02-22 09:38:45 -08:00
  • de05292b66 [openaddresses] Del Norte Couty, CA Al 2017-02-21 01:19:46 -08:00
  • 93768b7ba5 [openaddresses] Eaton County and Tecumseh, MI Al 2017-02-21 01:17:54 -08:00
  • 08c6831729 [openaddresses] LBC Al 2017-02-21 01:12:50 -08:00
  • a2fcac4909 [openaddresses] city of Flower Mound, TX Al 2017-02-21 01:09:06 -08:00
  • 1d705e80da [openaddresses] adding new BC district data sets Al 2017-02-21 01:07:47 -08:00
  • 6a079e86b3 [fix] using size_t instead of int in address_parser/address_parser_train Al 2017-02-20 19:22:13 -08:00
  • 8ea5405c20 [parser] using separate arrays for features requiring tag history and making the tagger responsible for those features so the feature function does not require passing in prev and prev2 explicitly (i.e. don't need to run the feature function multiple times if using global best-sequence prediction) Al 2017-02-19 14:21:32 -08:00
  • ae85e3c0a0 [openaddresses] adding Warren County, OH Al 2017-02-19 14:03:24 -08:00
  • 715520f681 [parser] using new zeros API in averaged_perceptron.c Al 2017-02-19 14:02:54 -08:00
  • 5444b722cb [addresses] do not exclude # from sampling in Spanish Al 2017-02-18 12:04:09 -08:00
  • f76faafd8c [openaddresses] adding a few house number phrases as well in Colombia Al 2017-02-18 12:03:02 -08:00
  • adfdc06d14 [addresses] using the number dictionary for abbreviations in house number phrases as well Al 2017-02-18 12:00:27 -08:00
  • 7cab675809 [openaddresses] adding random formatting to Colombian house numbers that match the {calle}-{building number} format Al 2017-02-18 11:28:14 -08:00
  • 146412f4f8 [openaddresses] adding country-specific validators and doing no validation on house numbers in Colombia Al 2017-02-18 11:04:02 -08:00
  • 0e10aa6f46 [openaddresses] adding OSM boundaries for Stearns County, MN Al 2017-02-18 10:18:09 -08:00
  • 5a31513092 [openaddresses] Adding city of Sioux Falls, SD Al 2017-02-18 10:13:56 -08:00
  • 64e62cac32 [openaddresses] adding Bogotá, Colombia Al 2017-02-18 10:13:31 -08:00
  • 4f128579d6 [openaddresses] adding Commerce City, CO and creating an alias for the simple unit regex for reuse Al 2017-02-17 14:07:00 -05:00
  • b88487f633 [utils] string_replace_char does single byte/character replacement, new string_replace to do full string replacement, again using char_array for safety, string_replace_with_array function for memory reuse Al 2017-02-17 13:58:51 -05:00
  • da856ea5c3 [parser] adding phrase features for category, unit, level, entrance, staircase, and po_box phrases from the libpostal dictionaries, excluding phrases which match the toponyms dictionary (e.g. US states that can also be found in street/venue names, useful for expansion but not here), if the current token is part of both an address dictionary phrase and a component phrase derived from the training data, use the longer of the two, or both if they are the same length Al 2017-02-17 03:00:48 -05:00
  • 5b616dfb57 [addresses] allowing neighborhood components to be passed in Al 2017-02-17 02:11:56 -05:00
  • e7d8577ad7 [openaddresses] add city of San Luis Obispo Al 2017-02-16 16:00:23 -05:00
  • d6281648dc [openaddresses] add Cumberland County, NC Al 2017-02-16 14:49:00 -05:00
  • 1631c25ad0 [openaddresses] add city of O'Fallon, IL Al 2017-02-16 14:48:40 -05:00
  • 4c4147f465 [openaddresses] add city of Scotsdale, AZ Al 2017-02-16 14:38:50 -05:00
  • df76cde1e7 [openaddresses] adding Pickens County, SC Al 2017-02-16 03:34:49 -05:00
  • c380b3e91b [parser] phrase search with address dictionaries should not use the language given at training time since it's not currently available at runtime (without pulling in the language classifier, which may be warranted at some point, especially if the model can be made smaller/sparser) Al 2017-02-15 22:32:30 -05:00
  • a3e51db32d [api] include some of the new components in default address_components for the libpostal expansion API Al 2017-02-15 22:29:22 -05:00
  • 32fb483e96 [gazetteers] adding ADDRESS_PO_BOX component Al 2017-02-15 22:23:28 -05:00
  • ba0ccc82a3 [fix] var name in address_parser_train Al 2017-02-15 22:22:33 -05:00