Al
c6af5cc071
[parser] Adding country_region label to parser as a boundary component
2016-07-28 15:19:48 -04:00
Tom Davis
18c8e90eb3
Use xargs to start workers as soon as possible
2016-07-27 17:46:44 -04:00
Tom Davis
11abf6cb22
Use posix sh for systems without bash
2016-07-26 20:17:18 -04:00
Al Barrentine
65c4688f89
Merge pull request #97 from uberbaud/multipart_edgecase
...
Don't call `download_multipart` for 1 chunk
2016-07-24 00:03:51 -04:00
Travis
3f0eff228e
[auto][ci skip] Adding data files from Travis build #145
2016-07-23 22:28:32 +00:00
Tom Davis
2991ffd193
Don't call download_multipart for 1 chunk
...
Previously, where a file was larger than `$LARGE_FILE_SIZE` but smaller
than `$CHUNK_SIZE*2`, `download_multipart` would be called but would
only download one (1) chunk that was the whole file.
This fix keeps the same download performance as before but optimizes
processing chunks out.
2016-07-23 16:41:04 -04:00
Tom Davis
24e0314e71
Remove call to seq which may not exist
2016-07-23 01:03:15 -04:00
Al
64f167f045
[tokenization] Re-generating scanner
2016-07-21 17:04:57 -04:00
Al
81b4a4a1cb
[tokenization] Hyphens, etc. between non-ASCII digits (e.g. Unicode full-width numbers) should be single tokens
2016-07-21 17:04:57 -04:00
Al
be5fd79a48
[expansion] Prefix/suffix expansions by default can apply to ADDRESS_ANY but also inherit the types of any dictionary that lists their canonical form (so we can add suffixes without worrying about whether they're for streets or place names, etc.)
2016-07-21 17:04:57 -04:00
Al
8926293063
[parser/cli] Using NFC normalization on the output in the parser client ( closes #30 ). Optional command-line arg for parser output dir, useful for spot-checking different experiments
2016-07-21 17:04:57 -04:00
Al
44908ff95a
[parser] No digit normalization in training data-derived parser phrases (for postcodes, etc.), phrases include the new island type, house number phrases if any are valid. Adjacent words are now full phrases if they are part of a multiword token like a city name. For hyphenated names like Carmel-by-the-Sea, adding a version to the phrase dictionary where the hyphens are replaced with spaces
2016-07-21 17:04:57 -04:00
Al
41ae742285
[fix] tokenized trie search when falling off the trie at the start of a valid phrase
2016-07-21 17:04:57 -04:00
Al
6e60b3bbda
[fix] semicolon in #define
2016-07-21 17:04:57 -04:00
Al
b5d4dd6f37
[tokenization] Including full-width numbers in numeric tokens
2016-07-21 17:04:57 -04:00
Al
dd7ef6fabf
[dictionaries] Making new component for near/nearby prepositions
2016-07-21 17:04:57 -04:00
Al
2454b98c6d
[tokenization] Reverting commit for tokenizing initial/final apostrophes as part of words as it may be more effective to handle during post-processing
2016-07-21 17:04:57 -04:00
Al
0a8f46bdc3
[parser] Using new geonames designations in parser features
2016-07-21 17:04:57 -04:00
Al
c383f8af88
[parser] Using NFC normalization for parser as well, @ sign not defined as separator since it may also be used in intersections
2016-07-21 17:04:57 -04:00
Al
c2ee5a45b3
[geodb] Adding separate bitset for geonames place types and using NFC normalization instead of NFD (requires retraining)
2016-07-21 17:04:57 -04:00
Al
6c39c663ff
[normalize] Adding NORMALIZE_STRING_COMPOSE for NFC unicode normalization
2016-07-21 17:04:57 -04:00
Al
757c6147cb
[tokenization] Adding ability to tokenize 's Gravenhage
2016-07-21 17:04:57 -04:00
Al
2e8888e331
[fix] warnings/size_t in libpostal.c
2016-07-21 17:04:57 -04:00
Al
e800f21f06
[gazetteers] Adding new gazetteer types/address components
2016-07-21 17:04:57 -04:00
Al
e5e0cf3b92
[fix] loading transliteration module in address_parser_test.c as well
2016-07-21 17:04:57 -04:00
Al
b8d43dc601
[fix] cstring_array_split calls
2016-07-21 17:04:57 -04:00
Al
b19cd3f60a
[fix] brace
2016-07-21 17:04:57 -04:00
Al
994b2f18e4
[parser] Ignore multiple spaces in parser input post-normalization. If normalizing the string creates several distinct tokens (namely in Vulgar fractions e.g. ½ => 1/2), add all the sub-tokens with the same label as the parent
2016-07-21 17:04:57 -04:00
Al
b664ab1cea
[utils] Adding cstring_array_split_ignore_consecutive
2016-07-21 17:04:57 -04:00
Al
8e90ee45d2
[fix] calls and NULL checks
2016-07-21 17:04:57 -04:00
Al
e3cffaf0d1
[fix] tokenized_string_t should copy its source string
2016-07-21 17:04:57 -04:00
Al
16501aba17
[fix] Need to load transliteration module for Latin-ASCII normalization
2016-07-21 17:04:57 -04:00
Al Barrentine
e02c6adc85
Merge pull request #91 from uberbaud/openbsd
...
Add support for OpenBSD
2016-07-20 19:47:18 -04:00
Tom Davis
c0366147e8
Add support for OpenBSD
2016-07-20 18:19:31 -04:00
Tom Davis
a8bb798ce0
Call libpostal_data in source path, not build path
...
This fix updates Makefile to find the actual libpostal_data file when
`configure` is called from another directory, which it uses as the build
directory.
2016-07-20 17:31:52 -04:00
Travis
a0f6e100f1
[auto][ci skip] Adding data files from Travis build #133
2016-07-17 19:13:46 +00:00
Al
12d50aac12
Merge branch 'master' of https://github.com/openvenues/libpostal
2016-07-17 15:03:52 -04:00
Al
83381e9d8a
[expand] Adding exception for a few types of special punctuation (ampersand, plus, pound sign) which should be left in the original string and separated by whitespace. Closes #84 . Closes #85
2016-07-17 15:02:47 -04:00
Travis
2fb677ca73
[auto][ci skip] Adding data files from Travis build #132
2016-07-17 18:47:28 +00:00
David Farrell
a7a9708d2b
don't error on multiple setup_parser()
2016-07-17 11:25:03 -04:00
Al
d7996ed56c
[fix] setting garbage pointer to NULL on language_classifier_teardown ( fixes #82 )
2016-07-17 01:56:09 -04:00
Al
ce78064988
[fix] NULL checks
2016-07-15 13:23:23 -04:00
Al
2f5f226faa
[fix] Add original string to normalizations if all options were set to false
2016-07-15 13:23:23 -04:00
Al
e816b4f77e
[parser] Ignore language/country options explicitly in the parser. The purpose of these options is not to be able to create language-specific/country-specific models at some point, shouldn't be used in the global model
2016-07-06 14:56:46 -04:00
Al
58a5dbe7e0
[logging] Logging the value of LIBPOSTAL_DATA_DIR when a setup error occurs
2016-07-01 14:51:04 -04:00
Al
ad9dfb46bd
[build] Using a process pool with 64MB chunks (similar to aws cli) for S3 downloads. Setting the max concurrent requeests to 10, also the default in aws cli.
2016-07-01 14:37:13 -04:00
Al
a9ba61585b
[fix] Adding set -e to data download script so it fails if any subcommands fail
2016-05-04 23:08:06 -04:00
Al
9819ebf949
[fix] always include expansions in the ambiguous expansion dictionary, no matter which component
2016-04-29 13:26:13 -04:00
Al
0bc3550c11
[expansion] Adding address_expansion_in_dictionary
2016-04-29 13:23:48 -04:00
Al
59e5fcd1b4
[fix] LC_ALL=C in data download script
2016-04-11 12:47:50 -04:00