diff --git a/README.md b/README.md index 65404814..4ad4009c 100644 --- a/README.md +++ b/README.md @@ -1,17 +1,154 @@ # libpostal: international street address NLP [![Build Status](https://travis-ci.org/openvenues/libpostal.svg?branch=master)](https://travis-ci.org/openvenues/libpostal) [![License](https://img.shields.io/github/license/openvenues/libpostal.svg)](https://github.com/openvenues/libpostal/blob/master/LICENSE) +[![OpenCollective](https://opencollective.com/libpostal/sponsors/badge.svg)](#sponsors) +[![OpenCollective](https://opencollective.com/libpostal/backers/badge.svg)](#backers) -:jp: :us: :gb: :ru: :fr: :kr: :it: :es: :cn: :de: +🇧🇷 🇫🇮 🇳🇬 :jp: 🇽🇰 🇧🇩 🇵🇱 🇻🇳 🇧🇪 🇲🇦 🇺🇦 🇯🇲 :ru: 🇮🇳 🇱🇻 🇧🇴 :de: 🇸🇳 🇦🇲 :kr: 🇳🇴 🇲🇽 🇨🇿 🇹🇷 :es: 🇸🇸 🇪🇪 🇧🇭 🇳🇱 :cn: 🇵🇹 🇵🇷 :gb: 🇵🇸 -libpostal is a C library for parsing/normalizing street addresses around the world. This [introductory blog post](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-b9d573e6cc86) is a good overview of the research and thought process behind libpostal. +libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. For a more comprehensive overview of the research, check out the [introductory blog post](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-b9d573e6cc86), but to sum up, the goal of this project is to understand location-based strings in every language, everywhere. -Addresses and the geographic coordinates they represent are essential for any location-based application (map search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines, which are designed for document indexing. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing. +🇷🇴 🇬🇭 🇦🇺 🇲🇾 🇭🇷 🇭🇹 :us: 🇿🇦 🇷🇸 🇨🇱 :it: 🇰🇪 🇨🇭 🇨🇺 🇸🇰 🇦🇴 🇩🇰 🇹🇿 🇦🇱 🇨🇴 🇮🇱 🇬🇹 :fr: 🇵🇭 🇦🇹 🇱🇨 🇮🇸 🇮🇩 🇦🇪 🇸🇰 🇹🇳 🇰🇭 🇦🇷 🇭🇰 -While libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally. +Addresses and the locations they represent are essential for any application dealing with maps (place search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing. Though libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally. The core library is written in pure C. Language bindings for [Python](https://github.com/openvenues/pypostal), [Ruby](https://github.com/openvenues/ruby_postal), [Go](https://github.com/openvenues/gopostal), [Java](https://github.com/openvenues/jpostal), [PHP](https://github.com/openvenues/php-postal), and [NodeJS](https://github.com/openvenues/node-postal) are officially supported and it's easy to write bindings in other languages. +Sponsors +------------ + +If your company is using libpostal, consider asking your organization to sponsor the project and help fund our continued research into geo + NLP. Interpreting what humans mean when they refer to locations is far from a solved problem, and sponsorships help us pursue new frontiers in machine geospatial intelligence. As a sponsor, your company logo will appear prominently on the Github repo page along with a link to your site. [Sponsorship info](https://opencollective.com/libpostal#sponsor) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Backers +------------ + +Individual users can also help support open geo NLP research by making a monthly donation: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Examples of parsing +------------------- + +libpostal implements the first statistical address parser that works well internationally, +trained on ~50 million addresses in over 100 countries and as many +languages. We use OpenStreetMap (anything with an addr:* tag) and the OpenCage +address format templates at: https://github.com/OpenCageData/address-formatting +to construct the training data, supplementing with containing polygons and +perturbing the inputs in a number of ways to make the parser as robust as possible +to messy real-world input. + +These example parse results are taken from the interactive address_parser program +that builds with libpostal when you run ```make```. Note that the parser is robust to +commas vs. no commas, casing, different permutations of components (if the input +is e.g. just city or just city/postcode). + +![parser](https://cloud.githubusercontent.com/assets/238455/13209628/2c465b50-d8f4-11e5-8e70-915c6b6d207b.gif) + +The parser achieves very high accuracy on held-out data, currently 98.9% +correct full parses (meaning a 1 in the numerator for getting *every* token +in the address correct). + +Usage (parser) +-------------- + +Here's an example of the parser API using the Python bindings: + +```python + +from postal.parser import parse_address +parse_address('The Book Club 100-106 Leonard St Shoreditch London EC2A 4RH, United Kingdom') +``` + +And an example with the C API: + +```c +#include +#include +#include + +int main(int argc, char **argv) { + // Setup (only called once at the beginning of your program) + if (!libpostal_setup() || !libpostal_setup_parser()) { + exit(EXIT_FAILURE); + } + + address_parser_options_t options = get_libpostal_address_parser_default_options(); + address_parser_response_t *parsed = parse_address("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA", options); + + for (size_t i = 0; i < parsed->num_components; i++) { + printf("%s: %s\n", parsed->labels[i], parsed->components[i]); + } + + // Free parse result + address_parser_response_destroy(parsed); + + // Teardown (only called once at the end of your program) + libpostal_teardown(); + libpostal_teardown_parser(); +} +``` + + Examples of normalization ------------------------- @@ -85,81 +222,24 @@ int main(int argc, char **argv) { } ``` -Examples of parsing -------------------- - -libpostal implements the first statistical address parser that works well internationally, -trained on ~50 million addresses in over 100 countries and as many -languages. We use OpenStreetMap (anything with an addr:* tag) and the OpenCage -address format templates at: https://github.com/OpenCageData/address-formatting -to construct the training data, supplementing with containing polygons and -perturbing the inputs in a number of ways to make the parser as robust as possible -to messy real-world input. - -These example parse results are taken from the interactive address_parser program -that builds with libpostal when you run ```make```. Note that the parser is robust to -commas vs. no commas, casing, different permutations of components (if the input -is e.g. just city or just city/postcode). - -![parser](https://cloud.githubusercontent.com/assets/238455/13209628/2c465b50-d8f4-11e5-8e70-915c6b6d207b.gif) - -The parser achieves very high accuracy on held-out data, currently 98.9% -correct full parses (meaning a 1 in the numerator for getting *every* token -in the address correct). - -Usage (parser) --------------- - -Here's an example of the parser API using the Python bindings: - -```python - -from postal.parser import parse_address -parse_address('The Book Club 100-106 Leonard St Shoreditch London EC2A 4RH, United Kingdom') -``` - -And an example with the C API: - -```c -#include -#include -#include - -int main(int argc, char **argv) { - // Setup (only called once at the beginning of your program) - if (!libpostal_setup() || !libpostal_setup_parser()) { - exit(EXIT_FAILURE); - } - - address_parser_options_t options = get_libpostal_address_parser_default_options(); - address_parser_response_t *parsed = parse_address("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA", options); - - for (size_t i = 0; i < parsed->num_components; i++) { - printf("%s: %s\n", parsed->labels[i], parsed->components[i]); - } - - // Free parse result - address_parser_response_destroy(parsed); - - // Teardown (only called once at the end of your program) - libpostal_teardown(); - libpostal_teardown_parser(); -} -``` - Installation ------------ Before you install, make sure you have the following prerequisites: -**On Linux (Ubuntu)** +**On Ubuntu/Debian** ``` sudo apt-get install curl libsnappy-dev autoconf automake libtool pkg-config ``` +**On CentOS/RHEL** +``` +sudo yum install snappy snappy-devel autoconf automake libtool pkgconfig +``` + **On Mac OSX** ``` -sudo brew install snappy autoconf automake libtool pkg-config +brew install snappy autoconf automake libtool pkg-config ``` Then to install the C library: @@ -203,16 +283,25 @@ Libpostal is designed to be used by higher-level languages. If you don't see yo - Java/JVM: [jpostal](https://github.com/openvenues/jpostal) - PHP: [php-postal](https://github.com/openvenues/php-postal) - NodeJS: [node-postal](https://github.com/openvenues/node-postal) +- R: [poster](https://github.com/ironholds/poster) **Unofficial language bindings** - LuaJIT: [lua-resty-postal](https://github.com/bungle/lua-resty-postal) -- R: [poster](https://github.com/ironholds/poster) +- Perl: [Geo::libpostal](https://metacpan.org/pod/Geo::libpostal) **Database extensions** - PostgreSQL: [pgsql-postal](https://github.com/pramsey/pgsql-postal) +**Unofficial REST API** + +- Libpostal REST: [libpostal REST](https://github.com/johnlonganecker/libpostal-rest) + +**Libpostal REST Docker** + +- Libpostal REST Docker [Libpostal REST Docker](https://github.com/johnlonganecker/libpostal-rest-docker) + Command-line usage (expand) --------------------------- diff --git a/bootstrap.sh b/bootstrap.sh index 302de29c..3894e867 100755 --- a/bootstrap.sh +++ b/bootstrap.sh @@ -1,2 +1,2 @@ -#!/usr/bin/env bash +#!/bin/sh autoreconf -fi --warning=no-portability diff --git a/src/address_parser.c b/src/address_parser.c index 4c4b90f2..7b7ea47d 100644 --- a/src/address_parser.c +++ b/src/address_parser.c @@ -1024,6 +1024,22 @@ address_parser_response_t *address_parser_parse(char *address, char *language, c uint32_array_push(context->separators, ADDRESS_SEPARATOR_NONE); } + // This parser was trained without knowing language/country. + // If at some point we build country-specific/language-specific + // parsers, these parameters could be used to select a model. + // The language parameter does technically control which dictionaries + // are searched at the street level. It's possible with e.g. a phrase + // like "de", which can be either the German country code or a stopword + // in Spanish, that even in the case where it's being used as a country code, + // it's possible that both the street-level and admin-level phrase features + // may be working together as a kind of intercept. Depriving the model + // of the street-level phrase features by passing in a known language + // may change the decision threshold so explicitly ignore these + // options until there's a use for them (country-specific or language-specific + // parser models). + + language = NULL; + country = NULL; address_parser_context_fill(context, parser, tokenized_str, language, country); address_parser_response_t *response = NULL; diff --git a/src/geodb.c b/src/geodb.c index 11e4b12f..8b0dfecb 100644 --- a/src/geodb.c +++ b/src/geodb.c @@ -233,7 +233,7 @@ bool geodb_module_setup(char *dir) { return geodb_load(dir == NULL ? LIBPOSTAL_GEODB_DIR : dir); } - return false; + return true; } diff --git a/src/libpostal_data b/src/libpostal_data index 2a956bb9..0d73d132 100755 --- a/src/libpostal_data +++ b/src/libpostal_data @@ -1,4 +1,4 @@ -#!/usr/bin/env bash +#!/bin/sh set -e @@ -26,7 +26,7 @@ LIBPOSTAL_GEO_UPDATED_PATH=$LIBPOSTAL_DATA_DIR/last_updated_geo LIBPOSTAL_PARSER_UPDATED_PATH=$LIBPOSTAL_DATA_DIR/last_updated_parser LIBPOSTAL_LANG_CLASS_UPDATED_PATH=$LIBPOSTAL_DATA_DIR/last_updated_language_classifier -BASIC_MODULE_DIRS=(address_expansions numex transliteration) +BASIC_MODULE_DIRS="address_expansions numex transliteration" GEODB_MODULE_DIR=geodb PARSER_MODULE_DIR=address_parser LANGUAGE_CLASSIFIER_MODULE_DIR=language_classifier @@ -36,41 +36,51 @@ export LC_ALL=C EPOCH_DATE="Jan 1 00:00:00 1970" MB=$((1024*1024)) -LARGE_FILE_SIZE=$((100*$MB)) +CHUNK_SIZE=$((64*$MB)) -NUM_WORKERS=5 +LARGE_FILE_SIZE=$((CHUNK_SIZE*2)) -function kill_background_processes { + +NUM_WORKERS=10 + +kill_background_processes() { jobs -p | xargs kill; exit } -trap kill_background_processes SIGINT +trap kill_background_processes INT -function download_multipart() { +PART_MSG='echo "Downloading part $1: filename=$5, offset=$2, max=$3"' +PART_CURL='curl $4 --silent -H"Range:bytes=$2-$3" --retry 3 --retry-delay 2 -o $5' +DOWNLOAD_PART="$PART_MSG;$PART_CURL" + + +download_multipart() { url=$1 filename=$2 size=$3 - num_workers=$4 - - echo "Downloading multipart: $url, size=$size" - chunk_size=$((size/num_workers)) + num_chunks=$((size/CHUNK_SIZE)) + echo "Downloading multipart: $url, size=$size, num_chunks=$num_chunks" offset=0 - for i in `seq 1 $((num_workers-1))`; do + i=0 + while [ $i -lt $num_chunks ]; do + i=$((i+1)) part_filename="$filename.$i" - echo "Downloading part $i: filename=$part_filename, offset=$offset, max=$((offset+chunk_size-1))" - curl $url --silent -H"Range:bytes=$offset-$((offset+chunk_size-1))" -o $part_filename & - offset=$((offset+chunk_size)) - done; - - echo "Downloading part $num_workers: filename=$filename.$num_workers, offset=$offset, max=$((size))" - curl --silent -H"Range:bytes=$offset-$size" $url -o "$filename.$num_workers" & - wait + if [ $i -lt $num_chunks ]; then + max=$((offset+CHUNK_SIZE-1)); + else + max=$size; + fi; + printf "%s\0%s\0%s\0%s\0%s\0" "$i" "$offset" "$max" "$url" "$part_filename" + offset=$((offset+CHUNK_SIZE)) + done | xargs -0 -n 5 -P $NUM_WORKERS sh -c "$DOWNLOAD_PART" -- > $local_path - for i in `seq 1 $((num_workers))`; do + i=0 + while [ $i -lt $num_chunks ]; do + i=$((i+1)) part_filename="$filename.$i" cat $part_filename >> $local_path rm $part_filename @@ -79,7 +89,7 @@ function download_multipart() { } -function download_file() { +download_file() { updated_path=$1 data_dir=$2 filename=$3 @@ -100,15 +110,15 @@ function download_file() { content_length=$(curl -I $url 2> /dev/null | awk '/^Content-Length:/ { print $2 }' | tr -d '[[:space:]]') if [ $content_length -ge $LARGE_FILE_SIZE ]; then - download_multipart $url $local_path $content_length $NUM_WORKERS + download_multipart $url $local_path $content_length else - curl $url -o $local_path + curl $url --retry 3 --retry-delay 2 -o $local_path fi - - if date -ur . >/dev/null 2>&1; then + + if date -d "@$(date -ur . +%s)" >/dev/null 2>&1; then echo $(date -d "$(date -d "@$(date -ur $local_path +%s)") + 1 second") > $updated_path; elif stat -f %Sm . >/dev/null 2>&1; then - echo $(date -r $(stat -f %m $local_path) -v+1S) > $updated_path; + echo $(date -ur $(stat -f %m $local_path) -v+1S) > $updated_path; fi; tar -xvzf $local_path -C $data_dir; rm $local_path; @@ -123,23 +133,23 @@ if [ $COMMAND = "download" ]; then if [ $FILE = "base" ] || [ $FILE = "all" ]; then download_file $LIBPOSTAL_DATA_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_DATA_FILE "data file" fi - if [ $FILE = "geodb" ] || [ $FILE = "all" ]; then + if [ $FILE = "geodb" ] || [ $FILE = "all" ]; then download_file $LIBPOSTAL_GEO_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_GEODB_FILE "geodb data file" fi - if [ $FILE = "parser" ] || [ $FILE = "all" ]; then + if [ $FILE = "parser" ] || [ $FILE = "all" ]; then download_file $LIBPOSTAL_PARSER_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_PARSER_FILE "parser data file" fi - if [ $FILE = "language_classifier" ] || [ $FILE = "all" ]; then + if [ $FILE = "language_classifier" ] || [ $FILE = "all" ]; then download_file $LIBPOSTAL_LANG_CLASS_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_LANG_CLASS_FILE "language classifier data file" fi elif [ $COMMAND = "upload" ]; then if [ $FILE = "base" ] || [ $FILE = "all" ]; then - tar -C $LIBPOSTAL_DATA_DIR -cvzf $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_DATA_FILE ${BASIC_MODULE_DIRS[*]} + tar -C $LIBPOSTAL_DATA_DIR -cvzf $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_DATA_FILE $BASIC_MODULE_DIRS aws s3 cp --acl=public-read $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_DATA_FILE $LIBPOSTAL_S3_KEY fi - + if [ $FILE = "geodb" ] || [ $FILE = "all" ]; then tar -C $LIBPOSTAL_DATA_DIR -cvzf $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_GEODB_FILE $GEODB_MODULE_DIR aws s3 cp --acl=public-read $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_GEODB_FILE $LIBPOSTAL_S3_KEY diff --git a/src/normalize.c b/src/normalize.c index 4eeda0f9..f0ee0a15 100644 --- a/src/normalize.c +++ b/src/normalize.c @@ -116,6 +116,8 @@ void add_latin_alternatives(string_tree_t *tree, char *str, size_t len, uint64_t } free(transliterated); transliterated = NULL; + } else { + string_tree_add_string(tree, str); } if (prev_string != NULL) { diff --git a/src/sparkey/Makefile.am b/src/sparkey/Makefile.am index 2fee3a2b..5f92673f 100644 --- a/src/sparkey/Makefile.am +++ b/src/sparkey/Makefile.am @@ -1,4 +1,5 @@ -CFLAGS = -I/usr/local/include -O2 -Wall -Wextra -Wfloat-equal -Wshadow -Wpointer-arith -Werror -pedantic +CFLAGS_CONF = @CFLAGS@ +CFLAGS = -I/usr/local/include -O2 -Wall -Wextra -Wfloat-equal -Wshadow -Wpointer-arith -Werror -pedantic $(CFLAGS_CONF) noinst_LTLIBRARIES = libsparkey.la libsparkey_la_SOURCES = endiantools.h hashheader.h logheader.h \ @@ -7,4 +8,4 @@ logreader.c returncodes.c util.c buf.h hashalgorithms.h hashiter.h \ sparkey.h util.h endiantools.c \ hashheader.c hashreader.c logheader.c logwriter.c MurmurHash3.c \ sparkey-internal.h -libsparkey_la_LDFLAGS = -L/usr/local/lib \ No newline at end of file +libsparkey_la_LDFLAGS = -L/usr/local/lib diff --git a/src/sparkey/endiantools.c b/src/sparkey/endiantools.c index 17eee630..5b3567c2 100644 --- a/src/sparkey/endiantools.c +++ b/src/sparkey/endiantools.c @@ -14,13 +14,17 @@ * the License. */ #if defined(__linux) -#include +# include #elif defined(__APPLE__) -#include -#define bswap_32 OSSwapInt32 -#define bswap_64 OSSwapInt64 +# include +# define bswap_32 OSSwapInt32 +# define bswap_64 OSSwapInt64 +#elif defined(__OpenBSD__) +# include +# define bswap_32 swap32 +# define bswap_64 swap64 #else -#error "no byteswap.h or libkern/OSByteOrder.h" +# error "no byteswap.h or libkern/OSByteOrder.h" #endif #include diff --git a/src/token_types.h b/src/token_types.h index 23248767..80dfab68 100644 --- a/src/token_types.h +++ b/src/token_types.h @@ -69,6 +69,8 @@ #define is_punctuation(type) ((type) >= PERIOD && (type) < OTHER) +#define is_special_punctuation(type) ((type) == AMPERSAND || (type) == PLUS || (type) == POUND) + #define is_special_token(type) ((type) == EMAIL || (type) == URL || (type) == US_PHONE || (type) == INTL_PHONE) #define is_whitespace(type) ((type) == WHITESPACE) diff --git a/test/test_expand.c b/test/test_expand.c index b057c42c..049803ff 100644 --- a/test/test_expand.c +++ b/test/test_expand.c @@ -84,6 +84,31 @@ TEST test_expansions_language_classifier(void) { PASS(); } +TEST test_expansions_no_options(void) { + normalize_options_t options = get_libpostal_default_options(); + options.lowercase = false; + options.latin_ascii = false; + options.transliterate = false; + options.strip_accents = false; + options.decompose = false; + options.trim_string = false; + options.drop_parentheticals = false; + options.replace_numeric_hyphens = false; + options.delete_numeric_hyphens = false; + options.split_alpha_from_numeric = false; + options.replace_word_hyphens = false; + options.delete_word_hyphens = false; + options.delete_final_periods = false; + options.delete_acronym_periods = false; + options.drop_english_possessives = false; + options.delete_apostrophes = false; + options.expand_numex = false; + options.roman_numerals = false; + + CHECK_CALL(test_expansion_contains_with_languages("120 E 96th St New York", "120 E 96th St New York", options, 0, NULL)); + PASS(); +} + SUITE(libpostal_expansion_tests) { @@ -94,6 +119,7 @@ SUITE(libpostal_expansion_tests) { RUN_TEST(test_expansions); RUN_TEST(test_expansions_language_classifier); + RUN_TEST(test_expansions_no_options); libpostal_teardown(); libpostal_teardown_language_classifier();