[merge] merging in master changes

This commit is contained in:
Al
2016-12-21 15:40:25 -05:00
10 changed files with 260 additions and 110 deletions

227
README.md
View File

@@ -1,17 +1,154 @@
# libpostal: international street address NLP
[![Build Status](https://travis-ci.org/openvenues/libpostal.svg?branch=master)](https://travis-ci.org/openvenues/libpostal) [![License](https://img.shields.io/github/license/openvenues/libpostal.svg)](https://github.com/openvenues/libpostal/blob/master/LICENSE)
[![OpenCollective](https://opencollective.com/libpostal/sponsors/badge.svg)](#sponsors)
[![OpenCollective](https://opencollective.com/libpostal/backers/badge.svg)](#backers)
:jp: :us: :gb: :ru: :fr: :kr: :it: :es: :cn: :de:
<span>&#x1f1e7;&#x1f1f7;</span> <span>&#x1f1eb;&#x1f1ee;</span> <span>&#x1f1f3;&#x1f1ec;</span> :jp: <span>&#x1f1fd;&#x1f1f0; </span> <span>&#x1f1e7;&#x1f1e9; </span> <span>&#x1f1f5;&#x1f1f1; </span> <span>&#x1f1fb;&#x1f1f3; </span> <span>&#x1f1e7;&#x1f1ea; </span> <span>&#x1f1f2;&#x1f1e6; </span> <span>&#x1f1fa;&#x1f1e6; </span> <span>&#x1f1ef;&#x1f1f2; </span> :ru: <span>&#x1f1ee;&#x1f1f3; </span> <span>&#x1f1f1;&#x1f1fb; </span> <span>&#x1f1e7;&#x1f1f4; </span> :de: <span>&#x1f1f8;&#x1f1f3; </span> <span>&#x1f1e6;&#x1f1f2; </span> :kr: <span>&#x1f1f3;&#x1f1f4; </span> <span>&#x1f1f2;&#x1f1fd; </span> <span>&#x1f1e8;&#x1f1ff; </span> <span>&#x1f1f9;&#x1f1f7; </span> :es: <span>&#x1f1f8;&#x1f1f8; </span> <span>&#x1f1ea;&#x1f1ea; </span> <span>&#x1f1e7;&#x1f1ed; </span> <span>&#x1f1f3;&#x1f1f1; </span> :cn: <span>&#x1f1f5;&#x1f1f9; </span> <span>&#x1f1f5;&#x1f1f7; </span> :gb: <span>&#x1f1f5;&#x1f1f8; </span>
libpostal is a C library for parsing/normalizing street addresses around the world. This [introductory blog post](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-b9d573e6cc86) is a good overview of the research and thought process behind libpostal.
libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. For a more comprehensive overview of the research, check out the [introductory blog post](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-b9d573e6cc86), but to sum up, the goal of this project is to understand location-based strings in every language, everywhere.
Addresses and the geographic coordinates they represent are essential for any location-based application (map search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines, which are designed for document indexing. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing.
<span>&#x1f1f7;&#x1f1f4; </span> <span>&#x1f1ec;&#x1f1ed; </span> <span>&#x1f1e6;&#x1f1fa; </span> <span>&#x1f1f2;&#x1f1fe; </span> <span>&#x1f1ed;&#x1f1f7; </span> <span>&#x1f1ed;&#x1f1f9; </span> :us: <span>&#x1f1ff;&#x1f1e6; </span> <span>&#x1f1f7;&#x1f1f8; </span> <span>&#x1f1e8;&#x1f1f1; </span> :it: <span>&#x1f1f0;&#x1f1ea; <span>&#x1f1e8;&#x1f1ed; </span> <span>&#x1f1e8;&#x1f1fa; </span> <span>&#x1f1f8;&#x1f1f0; </span> <span>&#x1f1e6;&#x1f1f4; </span> <span>&#x1f1e9;&#x1f1f0; </span> <span>&#x1f1f9;&#x1f1ff; </span> <span>&#x1f1e6;&#x1f1f1; </span> <span>&#x1f1e8;&#x1f1f4; </span> <span>&#x1f1ee;&#x1f1f1; </span> <span>&#x1f1ec;&#x1f1f9; </span> :fr: <span>&#x1f1f5;&#x1f1ed; </span> <span>&#x1f1e6;&#x1f1f9; </span> <span>&#x1f1f1;&#x1f1e8; </span> <span>&#x1f1ee;&#x1f1f8; <span>&#x1f1ee;&#x1f1e9; </span> </span> <span>&#x1f1e6;&#x1f1ea; </span> </span> <span>&#x1f1f8;&#x1f1f0; </span> <span>&#x1f1f9;&#x1f1f3; </span> <span>&#x1f1f0;&#x1f1ed; </span> <span>&#x1f1e6;&#x1f1f7; </span> <span>&#x1f1ed;&#x1f1f0; </span>
While libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally.
Addresses and the locations they represent are essential for any application dealing with maps (place search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing. Though libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally.
The core library is written in pure C. Language bindings for [Python](https://github.com/openvenues/pypostal), [Ruby](https://github.com/openvenues/ruby_postal), [Go](https://github.com/openvenues/gopostal), [Java](https://github.com/openvenues/jpostal), [PHP](https://github.com/openvenues/php-postal), and [NodeJS](https://github.com/openvenues/node-postal) are officially supported and it's easy to write bindings in other languages.
Sponsors
------------
If your company is using libpostal, consider asking your organization to sponsor the project and help fund our continued research into geo + NLP. Interpreting what humans mean when they refer to locations is far from a solved problem, and sponsorships help us pursue new frontiers in machine geospatial intelligence. As a sponsor, your company logo will appear prominently on the Github repo page along with a link to your site. [Sponsorship info](https://opencollective.com/libpostal#sponsor)
<a href="https://opencollective.com/libpostal/sponsor/0/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/0/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/1/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/1/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/2/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/2/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/3/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/3/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/4/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/4/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/5/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/5/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/6/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/6/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/7/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/7/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/8/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/8/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/9/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/9/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/10/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/10/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/11/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/11/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/12/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/12/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/13/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/13/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/14/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/14/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/15/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/15/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/16/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/16/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/17/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/17/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/18/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/18/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/19/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/19/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/20/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/20/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/21/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/21/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/22/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/22/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/23/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/23/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/24/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/24/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/25/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/25/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/26/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/26/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/27/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/27/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/28/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/28/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/sponsor/29/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/29/avatar.svg"></a>
Backers
------------
Individual users can also help support open geo NLP research by making a monthly donation:
<a href="https://opencollective.com/libpostal/backer/0/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/0/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/1/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/1/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/2/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/2/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/3/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/3/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/4/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/4/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/5/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/5/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/6/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/6/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/7/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/7/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/8/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/8/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/9/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/9/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/10/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/10/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/11/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/11/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/12/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/12/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/13/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/13/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/14/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/14/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/15/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/15/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/16/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/16/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/17/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/17/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/18/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/18/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/19/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/19/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/20/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/20/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/21/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/21/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/22/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/22/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/23/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/23/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/24/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/24/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/25/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/25/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/26/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/26/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/27/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/27/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/28/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/28/avatar.svg"></a>
<a href="https://opencollective.com/libpostal/backer/29/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/29/avatar.svg"></a>
Examples of parsing
-------------------
libpostal implements the first statistical address parser that works well internationally,
trained on ~50 million addresses in over 100 countries and as many
languages. We use OpenStreetMap (anything with an addr:* tag) and the OpenCage
address format templates at: https://github.com/OpenCageData/address-formatting
to construct the training data, supplementing with containing polygons and
perturbing the inputs in a number of ways to make the parser as robust as possible
to messy real-world input.
These example parse results are taken from the interactive address_parser program
that builds with libpostal when you run ```make```. Note that the parser is robust to
commas vs. no commas, casing, different permutations of components (if the input
is e.g. just city or just city/postcode).
![parser](https://cloud.githubusercontent.com/assets/238455/13209628/2c465b50-d8f4-11e5-8e70-915c6b6d207b.gif)
The parser achieves very high accuracy on held-out data, currently 98.9%
correct full parses (meaning a 1 in the numerator for getting *every* token
in the address correct).
Usage (parser)
--------------
Here's an example of the parser API using the Python bindings:
```python
from postal.parser import parse_address
parse_address('The Book Club 100-106 Leonard St Shoreditch London EC2A 4RH, United Kingdom')
```
And an example with the C API:
```c
#include <stdio.h>
#include <stdlib.h>
#include <libpostal/libpostal.h>
int main(int argc, char **argv) {
// Setup (only called once at the beginning of your program)
if (!libpostal_setup() || !libpostal_setup_parser()) {
exit(EXIT_FAILURE);
}
address_parser_options_t options = get_libpostal_address_parser_default_options();
address_parser_response_t *parsed = parse_address("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA", options);
for (size_t i = 0; i < parsed->num_components; i++) {
printf("%s: %s\n", parsed->labels[i], parsed->components[i]);
}
// Free parse result
address_parser_response_destroy(parsed);
// Teardown (only called once at the end of your program)
libpostal_teardown();
libpostal_teardown_parser();
}
```
Examples of normalization
-------------------------
@@ -85,81 +222,24 @@ int main(int argc, char **argv) {
}
```
Examples of parsing
-------------------
libpostal implements the first statistical address parser that works well internationally,
trained on ~50 million addresses in over 100 countries and as many
languages. We use OpenStreetMap (anything with an addr:* tag) and the OpenCage
address format templates at: https://github.com/OpenCageData/address-formatting
to construct the training data, supplementing with containing polygons and
perturbing the inputs in a number of ways to make the parser as robust as possible
to messy real-world input.
These example parse results are taken from the interactive address_parser program
that builds with libpostal when you run ```make```. Note that the parser is robust to
commas vs. no commas, casing, different permutations of components (if the input
is e.g. just city or just city/postcode).
![parser](https://cloud.githubusercontent.com/assets/238455/13209628/2c465b50-d8f4-11e5-8e70-915c6b6d207b.gif)
The parser achieves very high accuracy on held-out data, currently 98.9%
correct full parses (meaning a 1 in the numerator for getting *every* token
in the address correct).
Usage (parser)
--------------
Here's an example of the parser API using the Python bindings:
```python
from postal.parser import parse_address
parse_address('The Book Club 100-106 Leonard St Shoreditch London EC2A 4RH, United Kingdom')
```
And an example with the C API:
```c
#include <stdio.h>
#include <stdlib.h>
#include <libpostal/libpostal.h>
int main(int argc, char **argv) {
// Setup (only called once at the beginning of your program)
if (!libpostal_setup() || !libpostal_setup_parser()) {
exit(EXIT_FAILURE);
}
address_parser_options_t options = get_libpostal_address_parser_default_options();
address_parser_response_t *parsed = parse_address("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA", options);
for (size_t i = 0; i < parsed->num_components; i++) {
printf("%s: %s\n", parsed->labels[i], parsed->components[i]);
}
// Free parse result
address_parser_response_destroy(parsed);
// Teardown (only called once at the end of your program)
libpostal_teardown();
libpostal_teardown_parser();
}
```
Installation
------------
Before you install, make sure you have the following prerequisites:
**On Linux (Ubuntu)**
**On Ubuntu/Debian**
```
sudo apt-get install curl libsnappy-dev autoconf automake libtool pkg-config
```
**On CentOS/RHEL**
```
sudo yum install snappy snappy-devel autoconf automake libtool pkgconfig
```
**On Mac OSX**
```
sudo brew install snappy autoconf automake libtool pkg-config
brew install snappy autoconf automake libtool pkg-config
```
Then to install the C library:
@@ -203,16 +283,25 @@ Libpostal is designed to be used by higher-level languages. If you don't see yo
- Java/JVM: [jpostal](https://github.com/openvenues/jpostal)
- PHP: [php-postal](https://github.com/openvenues/php-postal)
- NodeJS: [node-postal](https://github.com/openvenues/node-postal)
- R: [poster](https://github.com/ironholds/poster)
**Unofficial language bindings**
- LuaJIT: [lua-resty-postal](https://github.com/bungle/lua-resty-postal)
- R: [poster](https://github.com/ironholds/poster)
- Perl: [Geo::libpostal](https://metacpan.org/pod/Geo::libpostal)
**Database extensions**
- PostgreSQL: [pgsql-postal](https://github.com/pramsey/pgsql-postal)
**Unofficial REST API**
- Libpostal REST: [libpostal REST](https://github.com/johnlonganecker/libpostal-rest)
**Libpostal REST Docker**
- Libpostal REST Docker [Libpostal REST Docker](https://github.com/johnlonganecker/libpostal-rest-docker)
Command-line usage (expand)
---------------------------

View File

@@ -1,2 +1,2 @@
#!/usr/bin/env bash
#!/bin/sh
autoreconf -fi --warning=no-portability

View File

@@ -1024,6 +1024,22 @@ address_parser_response_t *address_parser_parse(char *address, char *language, c
uint32_array_push(context->separators, ADDRESS_SEPARATOR_NONE);
}
// This parser was trained without knowing language/country.
// If at some point we build country-specific/language-specific
// parsers, these parameters could be used to select a model.
// The language parameter does technically control which dictionaries
// are searched at the street level. It's possible with e.g. a phrase
// like "de", which can be either the German country code or a stopword
// in Spanish, that even in the case where it's being used as a country code,
// it's possible that both the street-level and admin-level phrase features
// may be working together as a kind of intercept. Depriving the model
// of the street-level phrase features by passing in a known language
// may change the decision threshold so explicitly ignore these
// options until there's a use for them (country-specific or language-specific
// parser models).
language = NULL;
country = NULL;
address_parser_context_fill(context, parser, tokenized_str, language, country);
address_parser_response_t *response = NULL;

View File

@@ -233,7 +233,7 @@ bool geodb_module_setup(char *dir) {
return geodb_load(dir == NULL ? LIBPOSTAL_GEODB_DIR : dir);
}
return false;
return true;
}

View File

@@ -1,4 +1,4 @@
#!/usr/bin/env bash
#!/bin/sh
set -e
@@ -26,7 +26,7 @@ LIBPOSTAL_GEO_UPDATED_PATH=$LIBPOSTAL_DATA_DIR/last_updated_geo
LIBPOSTAL_PARSER_UPDATED_PATH=$LIBPOSTAL_DATA_DIR/last_updated_parser
LIBPOSTAL_LANG_CLASS_UPDATED_PATH=$LIBPOSTAL_DATA_DIR/last_updated_language_classifier
BASIC_MODULE_DIRS=(address_expansions numex transliteration)
BASIC_MODULE_DIRS="address_expansions numex transliteration"
GEODB_MODULE_DIR=geodb
PARSER_MODULE_DIR=address_parser
LANGUAGE_CLASSIFIER_MODULE_DIR=language_classifier
@@ -36,41 +36,51 @@ export LC_ALL=C
EPOCH_DATE="Jan 1 00:00:00 1970"
MB=$((1024*1024))
LARGE_FILE_SIZE=$((100*$MB))
CHUNK_SIZE=$((64*$MB))
NUM_WORKERS=5
LARGE_FILE_SIZE=$((CHUNK_SIZE*2))
function kill_background_processes {
NUM_WORKERS=10
kill_background_processes() {
jobs -p | xargs kill;
exit
}
trap kill_background_processes SIGINT
trap kill_background_processes INT
function download_multipart() {
PART_MSG='echo "Downloading part $1: filename=$5, offset=$2, max=$3"'
PART_CURL='curl $4 --silent -H"Range:bytes=$2-$3" --retry 3 --retry-delay 2 -o $5'
DOWNLOAD_PART="$PART_MSG;$PART_CURL"
download_multipart() {
url=$1
filename=$2
size=$3
num_workers=$4
echo "Downloading multipart: $url, size=$size"
chunk_size=$((size/num_workers))
num_chunks=$((size/CHUNK_SIZE))
echo "Downloading multipart: $url, size=$size, num_chunks=$num_chunks"
offset=0
for i in `seq 1 $((num_workers-1))`; do
i=0
while [ $i -lt $num_chunks ]; do
i=$((i+1))
part_filename="$filename.$i"
echo "Downloading part $i: filename=$part_filename, offset=$offset, max=$((offset+chunk_size-1))"
curl $url --silent -H"Range:bytes=$offset-$((offset+chunk_size-1))" -o $part_filename &
offset=$((offset+chunk_size))
done;
echo "Downloading part $num_workers: filename=$filename.$num_workers, offset=$offset, max=$((size))"
curl --silent -H"Range:bytes=$offset-$size" $url -o "$filename.$num_workers" &
wait
if [ $i -lt $num_chunks ]; then
max=$((offset+CHUNK_SIZE-1));
else
max=$size;
fi;
printf "%s\0%s\0%s\0%s\0%s\0" "$i" "$offset" "$max" "$url" "$part_filename"
offset=$((offset+CHUNK_SIZE))
done | xargs -0 -n 5 -P $NUM_WORKERS sh -c "$DOWNLOAD_PART" --
> $local_path
for i in `seq 1 $((num_workers))`; do
i=0
while [ $i -lt $num_chunks ]; do
i=$((i+1))
part_filename="$filename.$i"
cat $part_filename >> $local_path
rm $part_filename
@@ -79,7 +89,7 @@ function download_multipart() {
}
function download_file() {
download_file() {
updated_path=$1
data_dir=$2
filename=$3
@@ -100,15 +110,15 @@ function download_file() {
content_length=$(curl -I $url 2> /dev/null | awk '/^Content-Length:/ { print $2 }' | tr -d '[[:space:]]')
if [ $content_length -ge $LARGE_FILE_SIZE ]; then
download_multipart $url $local_path $content_length $NUM_WORKERS
download_multipart $url $local_path $content_length
else
curl $url -o $local_path
curl $url --retry 3 --retry-delay 2 -o $local_path
fi
if date -ur . >/dev/null 2>&1; then
if date -d "@$(date -ur . +%s)" >/dev/null 2>&1; then
echo $(date -d "$(date -d "@$(date -ur $local_path +%s)") + 1 second") > $updated_path;
elif stat -f %Sm . >/dev/null 2>&1; then
echo $(date -r $(stat -f %m $local_path) -v+1S) > $updated_path;
echo $(date -ur $(stat -f %m $local_path) -v+1S) > $updated_path;
fi;
tar -xvzf $local_path -C $data_dir;
rm $local_path;
@@ -123,23 +133,23 @@ if [ $COMMAND = "download" ]; then
if [ $FILE = "base" ] || [ $FILE = "all" ]; then
download_file $LIBPOSTAL_DATA_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_DATA_FILE "data file"
fi
if [ $FILE = "geodb" ] || [ $FILE = "all" ]; then
if [ $FILE = "geodb" ] || [ $FILE = "all" ]; then
download_file $LIBPOSTAL_GEO_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_GEODB_FILE "geodb data file"
fi
if [ $FILE = "parser" ] || [ $FILE = "all" ]; then
if [ $FILE = "parser" ] || [ $FILE = "all" ]; then
download_file $LIBPOSTAL_PARSER_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_PARSER_FILE "parser data file"
fi
if [ $FILE = "language_classifier" ] || [ $FILE = "all" ]; then
if [ $FILE = "language_classifier" ] || [ $FILE = "all" ]; then
download_file $LIBPOSTAL_LANG_CLASS_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_LANG_CLASS_FILE "language classifier data file"
fi
elif [ $COMMAND = "upload" ]; then
if [ $FILE = "base" ] || [ $FILE = "all" ]; then
tar -C $LIBPOSTAL_DATA_DIR -cvzf $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_DATA_FILE ${BASIC_MODULE_DIRS[*]}
tar -C $LIBPOSTAL_DATA_DIR -cvzf $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_DATA_FILE $BASIC_MODULE_DIRS
aws s3 cp --acl=public-read $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_DATA_FILE $LIBPOSTAL_S3_KEY
fi
if [ $FILE = "geodb" ] || [ $FILE = "all" ]; then
tar -C $LIBPOSTAL_DATA_DIR -cvzf $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_GEODB_FILE $GEODB_MODULE_DIR
aws s3 cp --acl=public-read $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_GEODB_FILE $LIBPOSTAL_S3_KEY

View File

@@ -116,6 +116,8 @@ void add_latin_alternatives(string_tree_t *tree, char *str, size_t len, uint64_t
}
free(transliterated);
transliterated = NULL;
} else {
string_tree_add_string(tree, str);
}
if (prev_string != NULL) {

View File

@@ -1,4 +1,5 @@
CFLAGS = -I/usr/local/include -O2 -Wall -Wextra -Wfloat-equal -Wshadow -Wpointer-arith -Werror -pedantic
CFLAGS_CONF = @CFLAGS@
CFLAGS = -I/usr/local/include -O2 -Wall -Wextra -Wfloat-equal -Wshadow -Wpointer-arith -Werror -pedantic $(CFLAGS_CONF)
noinst_LTLIBRARIES = libsparkey.la
libsparkey_la_SOURCES = endiantools.h hashheader.h logheader.h \
@@ -7,4 +8,4 @@ logreader.c returncodes.c util.c buf.h hashalgorithms.h hashiter.h \
sparkey.h util.h endiantools.c \
hashheader.c hashreader.c logheader.c logwriter.c MurmurHash3.c \
sparkey-internal.h
libsparkey_la_LDFLAGS = -L/usr/local/lib
libsparkey_la_LDFLAGS = -L/usr/local/lib

View File

@@ -14,13 +14,17 @@
* the License.
*/
#if defined(__linux)
#include <byteswap.h>
# include <byteswap.h>
#elif defined(__APPLE__)
#include <libkern/OSByteOrder.h>
#define bswap_32 OSSwapInt32
#define bswap_64 OSSwapInt64
# include <libkern/OSByteOrder.h>
# define bswap_32 OSSwapInt32
# define bswap_64 OSSwapInt64
#elif defined(__OpenBSD__)
# include <endian.h>
# define bswap_32 swap32
# define bswap_64 swap64
#else
#error "no byteswap.h or libkern/OSByteOrder.h"
# error "no byteswap.h or libkern/OSByteOrder.h"
#endif
#include <stddef.h>

View File

@@ -69,6 +69,8 @@
#define is_punctuation(type) ((type) >= PERIOD && (type) < OTHER)
#define is_special_punctuation(type) ((type) == AMPERSAND || (type) == PLUS || (type) == POUND)
#define is_special_token(type) ((type) == EMAIL || (type) == URL || (type) == US_PHONE || (type) == INTL_PHONE)
#define is_whitespace(type) ((type) == WHITESPACE)

View File

@@ -84,6 +84,31 @@ TEST test_expansions_language_classifier(void) {
PASS();
}
TEST test_expansions_no_options(void) {
normalize_options_t options = get_libpostal_default_options();
options.lowercase = false;
options.latin_ascii = false;
options.transliterate = false;
options.strip_accents = false;
options.decompose = false;
options.trim_string = false;
options.drop_parentheticals = false;
options.replace_numeric_hyphens = false;
options.delete_numeric_hyphens = false;
options.split_alpha_from_numeric = false;
options.replace_word_hyphens = false;
options.delete_word_hyphens = false;
options.delete_final_periods = false;
options.delete_acronym_periods = false;
options.drop_english_possessives = false;
options.delete_apostrophes = false;
options.expand_numex = false;
options.roman_numerals = false;
CHECK_CALL(test_expansion_contains_with_languages("120 E 96th St New York", "120 E 96th St New York", options, 0, NULL));
PASS();
}
SUITE(libpostal_expansion_tests) {
@@ -94,6 +119,7 @@ SUITE(libpostal_expansion_tests) {
RUN_TEST(test_expansions);
RUN_TEST(test_expansions_language_classifier);
RUN_TEST(test_expansions_no_options);
libpostal_teardown();
libpostal_teardown_language_classifier();