[merge] merging in master changes
This commit is contained in:
227
README.md
227
README.md
@@ -1,17 +1,154 @@
|
|||||||
# libpostal: international street address NLP
|
# libpostal: international street address NLP
|
||||||
|
|
||||||
[](https://travis-ci.org/openvenues/libpostal) [](https://github.com/openvenues/libpostal/blob/master/LICENSE)
|
[](https://travis-ci.org/openvenues/libpostal) [](https://github.com/openvenues/libpostal/blob/master/LICENSE)
|
||||||
|
[](#sponsors)
|
||||||
|
[](#backers)
|
||||||
|
|
||||||
:jp: :us: :gb: :ru: :fr: :kr: :it: :es: :cn: :de:
|
<span>🇧🇷</span> <span>🇫🇮</span> <span>🇳🇬</span> :jp: <span>🇽🇰 </span> <span>🇧🇩 </span> <span>🇵🇱 </span> <span>🇻🇳 </span> <span>🇧🇪 </span> <span>🇲🇦 </span> <span>🇺🇦 </span> <span>🇯🇲 </span> :ru: <span>🇮🇳 </span> <span>🇱🇻 </span> <span>🇧🇴 </span> :de: <span>🇸🇳 </span> <span>🇦🇲 </span> :kr: <span>🇳🇴 </span> <span>🇲🇽 </span> <span>🇨🇿 </span> <span>🇹🇷 </span> :es: <span>🇸🇸 </span> <span>🇪🇪 </span> <span>🇧🇭 </span> <span>🇳🇱 </span> :cn: <span>🇵🇹 </span> <span>🇵🇷 </span> :gb: <span>🇵🇸 </span>
|
||||||
|
|
||||||
libpostal is a C library for parsing/normalizing street addresses around the world. This [introductory blog post](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-b9d573e6cc86) is a good overview of the research and thought process behind libpostal.
|
libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. For a more comprehensive overview of the research, check out the [introductory blog post](https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-b9d573e6cc86), but to sum up, the goal of this project is to understand location-based strings in every language, everywhere.
|
||||||
|
|
||||||
Addresses and the geographic coordinates they represent are essential for any location-based application (map search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines, which are designed for document indexing. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing.
|
<span>🇷🇴 </span> <span>🇬🇭 </span> <span>🇦🇺 </span> <span>🇲🇾 </span> <span>🇭🇷 </span> <span>🇭🇹 </span> :us: <span>🇿🇦 </span> <span>🇷🇸 </span> <span>🇨🇱 </span> :it: <span>🇰🇪 <span>🇨🇭 </span> <span>🇨🇺 </span> <span>🇸🇰 </span> <span>🇦🇴 </span> <span>🇩🇰 </span> <span>🇹🇿 </span> <span>🇦🇱 </span> <span>🇨🇴 </span> <span>🇮🇱 </span> <span>🇬🇹 </span> :fr: <span>🇵🇭 </span> <span>🇦🇹 </span> <span>🇱🇨 </span> <span>🇮🇸 <span>🇮🇩 </span> </span> <span>🇦🇪 </span> </span> <span>🇸🇰 </span> <span>🇹🇳 </span> <span>🇰🇭 </span> <span>🇦🇷 </span> <span>🇭🇰 </span>
|
||||||
|
|
||||||
While libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally.
|
Addresses and the locations they represent are essential for any application dealing with maps (place search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing. Though libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally.
|
||||||
|
|
||||||
The core library is written in pure C. Language bindings for [Python](https://github.com/openvenues/pypostal), [Ruby](https://github.com/openvenues/ruby_postal), [Go](https://github.com/openvenues/gopostal), [Java](https://github.com/openvenues/jpostal), [PHP](https://github.com/openvenues/php-postal), and [NodeJS](https://github.com/openvenues/node-postal) are officially supported and it's easy to write bindings in other languages.
|
The core library is written in pure C. Language bindings for [Python](https://github.com/openvenues/pypostal), [Ruby](https://github.com/openvenues/ruby_postal), [Go](https://github.com/openvenues/gopostal), [Java](https://github.com/openvenues/jpostal), [PHP](https://github.com/openvenues/php-postal), and [NodeJS](https://github.com/openvenues/node-postal) are officially supported and it's easy to write bindings in other languages.
|
||||||
|
|
||||||
|
Sponsors
|
||||||
|
------------
|
||||||
|
|
||||||
|
If your company is using libpostal, consider asking your organization to sponsor the project and help fund our continued research into geo + NLP. Interpreting what humans mean when they refer to locations is far from a solved problem, and sponsorships help us pursue new frontiers in machine geospatial intelligence. As a sponsor, your company logo will appear prominently on the Github repo page along with a link to your site. [Sponsorship info](https://opencollective.com/libpostal#sponsor)
|
||||||
|
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/0/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/0/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/1/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/1/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/2/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/2/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/3/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/3/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/4/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/4/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/5/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/5/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/6/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/6/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/7/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/7/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/8/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/8/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/9/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/9/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/10/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/10/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/11/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/11/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/12/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/12/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/13/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/13/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/14/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/14/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/15/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/15/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/16/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/16/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/17/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/17/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/18/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/18/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/19/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/19/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/20/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/20/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/21/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/21/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/22/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/22/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/23/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/23/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/24/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/24/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/25/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/25/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/26/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/26/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/27/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/27/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/28/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/28/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/sponsor/29/website" target="_blank"><img src="https://opencollective.com/libpostal/sponsor/29/avatar.svg"></a>
|
||||||
|
|
||||||
|
Backers
|
||||||
|
------------
|
||||||
|
|
||||||
|
Individual users can also help support open geo NLP research by making a monthly donation:
|
||||||
|
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/0/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/0/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/1/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/1/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/2/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/2/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/3/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/3/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/4/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/4/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/5/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/5/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/6/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/6/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/7/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/7/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/8/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/8/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/9/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/9/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/10/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/10/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/11/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/11/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/12/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/12/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/13/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/13/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/14/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/14/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/15/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/15/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/16/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/16/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/17/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/17/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/18/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/18/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/19/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/19/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/20/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/20/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/21/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/21/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/22/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/22/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/23/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/23/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/24/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/24/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/25/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/25/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/26/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/26/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/27/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/27/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/28/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/28/avatar.svg"></a>
|
||||||
|
<a href="https://opencollective.com/libpostal/backer/29/website" target="_blank"><img src="https://opencollective.com/libpostal/backer/29/avatar.svg"></a>
|
||||||
|
|
||||||
|
Examples of parsing
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
libpostal implements the first statistical address parser that works well internationally,
|
||||||
|
trained on ~50 million addresses in over 100 countries and as many
|
||||||
|
languages. We use OpenStreetMap (anything with an addr:* tag) and the OpenCage
|
||||||
|
address format templates at: https://github.com/OpenCageData/address-formatting
|
||||||
|
to construct the training data, supplementing with containing polygons and
|
||||||
|
perturbing the inputs in a number of ways to make the parser as robust as possible
|
||||||
|
to messy real-world input.
|
||||||
|
|
||||||
|
These example parse results are taken from the interactive address_parser program
|
||||||
|
that builds with libpostal when you run ```make```. Note that the parser is robust to
|
||||||
|
commas vs. no commas, casing, different permutations of components (if the input
|
||||||
|
is e.g. just city or just city/postcode).
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
The parser achieves very high accuracy on held-out data, currently 98.9%
|
||||||
|
correct full parses (meaning a 1 in the numerator for getting *every* token
|
||||||
|
in the address correct).
|
||||||
|
|
||||||
|
Usage (parser)
|
||||||
|
--------------
|
||||||
|
|
||||||
|
Here's an example of the parser API using the Python bindings:
|
||||||
|
|
||||||
|
```python
|
||||||
|
|
||||||
|
from postal.parser import parse_address
|
||||||
|
parse_address('The Book Club 100-106 Leonard St Shoreditch London EC2A 4RH, United Kingdom')
|
||||||
|
```
|
||||||
|
|
||||||
|
And an example with the C API:
|
||||||
|
|
||||||
|
```c
|
||||||
|
#include <stdio.h>
|
||||||
|
#include <stdlib.h>
|
||||||
|
#include <libpostal/libpostal.h>
|
||||||
|
|
||||||
|
int main(int argc, char **argv) {
|
||||||
|
// Setup (only called once at the beginning of your program)
|
||||||
|
if (!libpostal_setup() || !libpostal_setup_parser()) {
|
||||||
|
exit(EXIT_FAILURE);
|
||||||
|
}
|
||||||
|
|
||||||
|
address_parser_options_t options = get_libpostal_address_parser_default_options();
|
||||||
|
address_parser_response_t *parsed = parse_address("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA", options);
|
||||||
|
|
||||||
|
for (size_t i = 0; i < parsed->num_components; i++) {
|
||||||
|
printf("%s: %s\n", parsed->labels[i], parsed->components[i]);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Free parse result
|
||||||
|
address_parser_response_destroy(parsed);
|
||||||
|
|
||||||
|
// Teardown (only called once at the end of your program)
|
||||||
|
libpostal_teardown();
|
||||||
|
libpostal_teardown_parser();
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
Examples of normalization
|
Examples of normalization
|
||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
@@ -85,81 +222,24 @@ int main(int argc, char **argv) {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Examples of parsing
|
|
||||||
-------------------
|
|
||||||
|
|
||||||
libpostal implements the first statistical address parser that works well internationally,
|
|
||||||
trained on ~50 million addresses in over 100 countries and as many
|
|
||||||
languages. We use OpenStreetMap (anything with an addr:* tag) and the OpenCage
|
|
||||||
address format templates at: https://github.com/OpenCageData/address-formatting
|
|
||||||
to construct the training data, supplementing with containing polygons and
|
|
||||||
perturbing the inputs in a number of ways to make the parser as robust as possible
|
|
||||||
to messy real-world input.
|
|
||||||
|
|
||||||
These example parse results are taken from the interactive address_parser program
|
|
||||||
that builds with libpostal when you run ```make```. Note that the parser is robust to
|
|
||||||
commas vs. no commas, casing, different permutations of components (if the input
|
|
||||||
is e.g. just city or just city/postcode).
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
The parser achieves very high accuracy on held-out data, currently 98.9%
|
|
||||||
correct full parses (meaning a 1 in the numerator for getting *every* token
|
|
||||||
in the address correct).
|
|
||||||
|
|
||||||
Usage (parser)
|
|
||||||
--------------
|
|
||||||
|
|
||||||
Here's an example of the parser API using the Python bindings:
|
|
||||||
|
|
||||||
```python
|
|
||||||
|
|
||||||
from postal.parser import parse_address
|
|
||||||
parse_address('The Book Club 100-106 Leonard St Shoreditch London EC2A 4RH, United Kingdom')
|
|
||||||
```
|
|
||||||
|
|
||||||
And an example with the C API:
|
|
||||||
|
|
||||||
```c
|
|
||||||
#include <stdio.h>
|
|
||||||
#include <stdlib.h>
|
|
||||||
#include <libpostal/libpostal.h>
|
|
||||||
|
|
||||||
int main(int argc, char **argv) {
|
|
||||||
// Setup (only called once at the beginning of your program)
|
|
||||||
if (!libpostal_setup() || !libpostal_setup_parser()) {
|
|
||||||
exit(EXIT_FAILURE);
|
|
||||||
}
|
|
||||||
|
|
||||||
address_parser_options_t options = get_libpostal_address_parser_default_options();
|
|
||||||
address_parser_response_t *parsed = parse_address("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA", options);
|
|
||||||
|
|
||||||
for (size_t i = 0; i < parsed->num_components; i++) {
|
|
||||||
printf("%s: %s\n", parsed->labels[i], parsed->components[i]);
|
|
||||||
}
|
|
||||||
|
|
||||||
// Free parse result
|
|
||||||
address_parser_response_destroy(parsed);
|
|
||||||
|
|
||||||
// Teardown (only called once at the end of your program)
|
|
||||||
libpostal_teardown();
|
|
||||||
libpostal_teardown_parser();
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Installation
|
Installation
|
||||||
------------
|
------------
|
||||||
|
|
||||||
Before you install, make sure you have the following prerequisites:
|
Before you install, make sure you have the following prerequisites:
|
||||||
|
|
||||||
**On Linux (Ubuntu)**
|
**On Ubuntu/Debian**
|
||||||
```
|
```
|
||||||
sudo apt-get install curl libsnappy-dev autoconf automake libtool pkg-config
|
sudo apt-get install curl libsnappy-dev autoconf automake libtool pkg-config
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**On CentOS/RHEL**
|
||||||
|
```
|
||||||
|
sudo yum install snappy snappy-devel autoconf automake libtool pkgconfig
|
||||||
|
```
|
||||||
|
|
||||||
**On Mac OSX**
|
**On Mac OSX**
|
||||||
```
|
```
|
||||||
sudo brew install snappy autoconf automake libtool pkg-config
|
brew install snappy autoconf automake libtool pkg-config
|
||||||
```
|
```
|
||||||
|
|
||||||
Then to install the C library:
|
Then to install the C library:
|
||||||
@@ -203,16 +283,25 @@ Libpostal is designed to be used by higher-level languages. If you don't see yo
|
|||||||
- Java/JVM: [jpostal](https://github.com/openvenues/jpostal)
|
- Java/JVM: [jpostal](https://github.com/openvenues/jpostal)
|
||||||
- PHP: [php-postal](https://github.com/openvenues/php-postal)
|
- PHP: [php-postal](https://github.com/openvenues/php-postal)
|
||||||
- NodeJS: [node-postal](https://github.com/openvenues/node-postal)
|
- NodeJS: [node-postal](https://github.com/openvenues/node-postal)
|
||||||
|
- R: [poster](https://github.com/ironholds/poster)
|
||||||
|
|
||||||
**Unofficial language bindings**
|
**Unofficial language bindings**
|
||||||
|
|
||||||
- LuaJIT: [lua-resty-postal](https://github.com/bungle/lua-resty-postal)
|
- LuaJIT: [lua-resty-postal](https://github.com/bungle/lua-resty-postal)
|
||||||
- R: [poster](https://github.com/ironholds/poster)
|
- Perl: [Geo::libpostal](https://metacpan.org/pod/Geo::libpostal)
|
||||||
|
|
||||||
**Database extensions**
|
**Database extensions**
|
||||||
|
|
||||||
- PostgreSQL: [pgsql-postal](https://github.com/pramsey/pgsql-postal)
|
- PostgreSQL: [pgsql-postal](https://github.com/pramsey/pgsql-postal)
|
||||||
|
|
||||||
|
**Unofficial REST API**
|
||||||
|
|
||||||
|
- Libpostal REST: [libpostal REST](https://github.com/johnlonganecker/libpostal-rest)
|
||||||
|
|
||||||
|
**Libpostal REST Docker**
|
||||||
|
|
||||||
|
- Libpostal REST Docker [Libpostal REST Docker](https://github.com/johnlonganecker/libpostal-rest-docker)
|
||||||
|
|
||||||
Command-line usage (expand)
|
Command-line usage (expand)
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
||||||
|
|||||||
@@ -1,2 +1,2 @@
|
|||||||
#!/usr/bin/env bash
|
#!/bin/sh
|
||||||
autoreconf -fi --warning=no-portability
|
autoreconf -fi --warning=no-portability
|
||||||
|
|||||||
@@ -1024,6 +1024,22 @@ address_parser_response_t *address_parser_parse(char *address, char *language, c
|
|||||||
uint32_array_push(context->separators, ADDRESS_SEPARATOR_NONE);
|
uint32_array_push(context->separators, ADDRESS_SEPARATOR_NONE);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// This parser was trained without knowing language/country.
|
||||||
|
// If at some point we build country-specific/language-specific
|
||||||
|
// parsers, these parameters could be used to select a model.
|
||||||
|
// The language parameter does technically control which dictionaries
|
||||||
|
// are searched at the street level. It's possible with e.g. a phrase
|
||||||
|
// like "de", which can be either the German country code or a stopword
|
||||||
|
// in Spanish, that even in the case where it's being used as a country code,
|
||||||
|
// it's possible that both the street-level and admin-level phrase features
|
||||||
|
// may be working together as a kind of intercept. Depriving the model
|
||||||
|
// of the street-level phrase features by passing in a known language
|
||||||
|
// may change the decision threshold so explicitly ignore these
|
||||||
|
// options until there's a use for them (country-specific or language-specific
|
||||||
|
// parser models).
|
||||||
|
|
||||||
|
language = NULL;
|
||||||
|
country = NULL;
|
||||||
address_parser_context_fill(context, parser, tokenized_str, language, country);
|
address_parser_context_fill(context, parser, tokenized_str, language, country);
|
||||||
|
|
||||||
address_parser_response_t *response = NULL;
|
address_parser_response_t *response = NULL;
|
||||||
|
|||||||
@@ -233,7 +233,7 @@ bool geodb_module_setup(char *dir) {
|
|||||||
return geodb_load(dir == NULL ? LIBPOSTAL_GEODB_DIR : dir);
|
return geodb_load(dir == NULL ? LIBPOSTAL_GEODB_DIR : dir);
|
||||||
}
|
}
|
||||||
|
|
||||||
return false;
|
return true;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
#!/usr/bin/env bash
|
#!/bin/sh
|
||||||
|
|
||||||
set -e
|
set -e
|
||||||
|
|
||||||
@@ -26,7 +26,7 @@ LIBPOSTAL_GEO_UPDATED_PATH=$LIBPOSTAL_DATA_DIR/last_updated_geo
|
|||||||
LIBPOSTAL_PARSER_UPDATED_PATH=$LIBPOSTAL_DATA_DIR/last_updated_parser
|
LIBPOSTAL_PARSER_UPDATED_PATH=$LIBPOSTAL_DATA_DIR/last_updated_parser
|
||||||
LIBPOSTAL_LANG_CLASS_UPDATED_PATH=$LIBPOSTAL_DATA_DIR/last_updated_language_classifier
|
LIBPOSTAL_LANG_CLASS_UPDATED_PATH=$LIBPOSTAL_DATA_DIR/last_updated_language_classifier
|
||||||
|
|
||||||
BASIC_MODULE_DIRS=(address_expansions numex transliteration)
|
BASIC_MODULE_DIRS="address_expansions numex transliteration"
|
||||||
GEODB_MODULE_DIR=geodb
|
GEODB_MODULE_DIR=geodb
|
||||||
PARSER_MODULE_DIR=address_parser
|
PARSER_MODULE_DIR=address_parser
|
||||||
LANGUAGE_CLASSIFIER_MODULE_DIR=language_classifier
|
LANGUAGE_CLASSIFIER_MODULE_DIR=language_classifier
|
||||||
@@ -36,41 +36,51 @@ export LC_ALL=C
|
|||||||
EPOCH_DATE="Jan 1 00:00:00 1970"
|
EPOCH_DATE="Jan 1 00:00:00 1970"
|
||||||
|
|
||||||
MB=$((1024*1024))
|
MB=$((1024*1024))
|
||||||
LARGE_FILE_SIZE=$((100*$MB))
|
CHUNK_SIZE=$((64*$MB))
|
||||||
|
|
||||||
NUM_WORKERS=5
|
LARGE_FILE_SIZE=$((CHUNK_SIZE*2))
|
||||||
|
|
||||||
function kill_background_processes {
|
|
||||||
|
NUM_WORKERS=10
|
||||||
|
|
||||||
|
kill_background_processes() {
|
||||||
jobs -p | xargs kill;
|
jobs -p | xargs kill;
|
||||||
exit
|
exit
|
||||||
}
|
}
|
||||||
|
|
||||||
trap kill_background_processes SIGINT
|
trap kill_background_processes INT
|
||||||
|
|
||||||
function download_multipart() {
|
PART_MSG='echo "Downloading part $1: filename=$5, offset=$2, max=$3"'
|
||||||
|
PART_CURL='curl $4 --silent -H"Range:bytes=$2-$3" --retry 3 --retry-delay 2 -o $5'
|
||||||
|
DOWNLOAD_PART="$PART_MSG;$PART_CURL"
|
||||||
|
|
||||||
|
|
||||||
|
download_multipart() {
|
||||||
url=$1
|
url=$1
|
||||||
filename=$2
|
filename=$2
|
||||||
size=$3
|
size=$3
|
||||||
num_workers=$4
|
|
||||||
|
|
||||||
echo "Downloading multipart: $url, size=$size"
|
|
||||||
chunk_size=$((size/num_workers))
|
|
||||||
|
|
||||||
|
num_chunks=$((size/CHUNK_SIZE))
|
||||||
|
echo "Downloading multipart: $url, size=$size, num_chunks=$num_chunks"
|
||||||
offset=0
|
offset=0
|
||||||
for i in `seq 1 $((num_workers-1))`; do
|
i=0
|
||||||
|
while [ $i -lt $num_chunks ]; do
|
||||||
|
i=$((i+1))
|
||||||
part_filename="$filename.$i"
|
part_filename="$filename.$i"
|
||||||
echo "Downloading part $i: filename=$part_filename, offset=$offset, max=$((offset+chunk_size-1))"
|
if [ $i -lt $num_chunks ]; then
|
||||||
curl $url --silent -H"Range:bytes=$offset-$((offset+chunk_size-1))" -o $part_filename &
|
max=$((offset+CHUNK_SIZE-1));
|
||||||
offset=$((offset+chunk_size))
|
else
|
||||||
done;
|
max=$size;
|
||||||
|
fi;
|
||||||
echo "Downloading part $num_workers: filename=$filename.$num_workers, offset=$offset, max=$((size))"
|
printf "%s\0%s\0%s\0%s\0%s\0" "$i" "$offset" "$max" "$url" "$part_filename"
|
||||||
curl --silent -H"Range:bytes=$offset-$size" $url -o "$filename.$num_workers" &
|
offset=$((offset+CHUNK_SIZE))
|
||||||
wait
|
done | xargs -0 -n 5 -P $NUM_WORKERS sh -c "$DOWNLOAD_PART" --
|
||||||
|
|
||||||
> $local_path
|
> $local_path
|
||||||
|
|
||||||
for i in `seq 1 $((num_workers))`; do
|
i=0
|
||||||
|
while [ $i -lt $num_chunks ]; do
|
||||||
|
i=$((i+1))
|
||||||
part_filename="$filename.$i"
|
part_filename="$filename.$i"
|
||||||
cat $part_filename >> $local_path
|
cat $part_filename >> $local_path
|
||||||
rm $part_filename
|
rm $part_filename
|
||||||
@@ -79,7 +89,7 @@ function download_multipart() {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
function download_file() {
|
download_file() {
|
||||||
updated_path=$1
|
updated_path=$1
|
||||||
data_dir=$2
|
data_dir=$2
|
||||||
filename=$3
|
filename=$3
|
||||||
@@ -100,15 +110,15 @@ function download_file() {
|
|||||||
content_length=$(curl -I $url 2> /dev/null | awk '/^Content-Length:/ { print $2 }' | tr -d '[[:space:]]')
|
content_length=$(curl -I $url 2> /dev/null | awk '/^Content-Length:/ { print $2 }' | tr -d '[[:space:]]')
|
||||||
|
|
||||||
if [ $content_length -ge $LARGE_FILE_SIZE ]; then
|
if [ $content_length -ge $LARGE_FILE_SIZE ]; then
|
||||||
download_multipart $url $local_path $content_length $NUM_WORKERS
|
download_multipart $url $local_path $content_length
|
||||||
else
|
else
|
||||||
curl $url -o $local_path
|
curl $url --retry 3 --retry-delay 2 -o $local_path
|
||||||
fi
|
fi
|
||||||
|
|
||||||
if date -ur . >/dev/null 2>&1; then
|
if date -d "@$(date -ur . +%s)" >/dev/null 2>&1; then
|
||||||
echo $(date -d "$(date -d "@$(date -ur $local_path +%s)") + 1 second") > $updated_path;
|
echo $(date -d "$(date -d "@$(date -ur $local_path +%s)") + 1 second") > $updated_path;
|
||||||
elif stat -f %Sm . >/dev/null 2>&1; then
|
elif stat -f %Sm . >/dev/null 2>&1; then
|
||||||
echo $(date -r $(stat -f %m $local_path) -v+1S) > $updated_path;
|
echo $(date -ur $(stat -f %m $local_path) -v+1S) > $updated_path;
|
||||||
fi;
|
fi;
|
||||||
tar -xvzf $local_path -C $data_dir;
|
tar -xvzf $local_path -C $data_dir;
|
||||||
rm $local_path;
|
rm $local_path;
|
||||||
@@ -123,23 +133,23 @@ if [ $COMMAND = "download" ]; then
|
|||||||
if [ $FILE = "base" ] || [ $FILE = "all" ]; then
|
if [ $FILE = "base" ] || [ $FILE = "all" ]; then
|
||||||
download_file $LIBPOSTAL_DATA_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_DATA_FILE "data file"
|
download_file $LIBPOSTAL_DATA_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_DATA_FILE "data file"
|
||||||
fi
|
fi
|
||||||
if [ $FILE = "geodb" ] || [ $FILE = "all" ]; then
|
if [ $FILE = "geodb" ] || [ $FILE = "all" ]; then
|
||||||
download_file $LIBPOSTAL_GEO_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_GEODB_FILE "geodb data file"
|
download_file $LIBPOSTAL_GEO_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_GEODB_FILE "geodb data file"
|
||||||
fi
|
fi
|
||||||
if [ $FILE = "parser" ] || [ $FILE = "all" ]; then
|
if [ $FILE = "parser" ] || [ $FILE = "all" ]; then
|
||||||
download_file $LIBPOSTAL_PARSER_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_PARSER_FILE "parser data file"
|
download_file $LIBPOSTAL_PARSER_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_PARSER_FILE "parser data file"
|
||||||
fi
|
fi
|
||||||
if [ $FILE = "language_classifier" ] || [ $FILE = "all" ]; then
|
if [ $FILE = "language_classifier" ] || [ $FILE = "all" ]; then
|
||||||
download_file $LIBPOSTAL_LANG_CLASS_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_LANG_CLASS_FILE "language classifier data file"
|
download_file $LIBPOSTAL_LANG_CLASS_UPDATED_PATH $LIBPOSTAL_DATA_DIR $LIBPOSTAL_LANG_CLASS_FILE "language classifier data file"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
elif [ $COMMAND = "upload" ]; then
|
elif [ $COMMAND = "upload" ]; then
|
||||||
|
|
||||||
if [ $FILE = "base" ] || [ $FILE = "all" ]; then
|
if [ $FILE = "base" ] || [ $FILE = "all" ]; then
|
||||||
tar -C $LIBPOSTAL_DATA_DIR -cvzf $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_DATA_FILE ${BASIC_MODULE_DIRS[*]}
|
tar -C $LIBPOSTAL_DATA_DIR -cvzf $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_DATA_FILE $BASIC_MODULE_DIRS
|
||||||
aws s3 cp --acl=public-read $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_DATA_FILE $LIBPOSTAL_S3_KEY
|
aws s3 cp --acl=public-read $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_DATA_FILE $LIBPOSTAL_S3_KEY
|
||||||
fi
|
fi
|
||||||
|
|
||||||
if [ $FILE = "geodb" ] || [ $FILE = "all" ]; then
|
if [ $FILE = "geodb" ] || [ $FILE = "all" ]; then
|
||||||
tar -C $LIBPOSTAL_DATA_DIR -cvzf $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_GEODB_FILE $GEODB_MODULE_DIR
|
tar -C $LIBPOSTAL_DATA_DIR -cvzf $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_GEODB_FILE $GEODB_MODULE_DIR
|
||||||
aws s3 cp --acl=public-read $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_GEODB_FILE $LIBPOSTAL_S3_KEY
|
aws s3 cp --acl=public-read $LIBPOSTAL_DATA_DIR/$LIBPOSTAL_GEODB_FILE $LIBPOSTAL_S3_KEY
|
||||||
|
|||||||
@@ -116,6 +116,8 @@ void add_latin_alternatives(string_tree_t *tree, char *str, size_t len, uint64_t
|
|||||||
}
|
}
|
||||||
free(transliterated);
|
free(transliterated);
|
||||||
transliterated = NULL;
|
transliterated = NULL;
|
||||||
|
} else {
|
||||||
|
string_tree_add_string(tree, str);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (prev_string != NULL) {
|
if (prev_string != NULL) {
|
||||||
|
|||||||
@@ -1,4 +1,5 @@
|
|||||||
CFLAGS = -I/usr/local/include -O2 -Wall -Wextra -Wfloat-equal -Wshadow -Wpointer-arith -Werror -pedantic
|
CFLAGS_CONF = @CFLAGS@
|
||||||
|
CFLAGS = -I/usr/local/include -O2 -Wall -Wextra -Wfloat-equal -Wshadow -Wpointer-arith -Werror -pedantic $(CFLAGS_CONF)
|
||||||
|
|
||||||
noinst_LTLIBRARIES = libsparkey.la
|
noinst_LTLIBRARIES = libsparkey.la
|
||||||
libsparkey_la_SOURCES = endiantools.h hashheader.h logheader.h \
|
libsparkey_la_SOURCES = endiantools.h hashheader.h logheader.h \
|
||||||
@@ -7,4 +8,4 @@ logreader.c returncodes.c util.c buf.h hashalgorithms.h hashiter.h \
|
|||||||
sparkey.h util.h endiantools.c \
|
sparkey.h util.h endiantools.c \
|
||||||
hashheader.c hashreader.c logheader.c logwriter.c MurmurHash3.c \
|
hashheader.c hashreader.c logheader.c logwriter.c MurmurHash3.c \
|
||||||
sparkey-internal.h
|
sparkey-internal.h
|
||||||
libsparkey_la_LDFLAGS = -L/usr/local/lib
|
libsparkey_la_LDFLAGS = -L/usr/local/lib
|
||||||
|
|||||||
@@ -14,13 +14,17 @@
|
|||||||
* the License.
|
* the License.
|
||||||
*/
|
*/
|
||||||
#if defined(__linux)
|
#if defined(__linux)
|
||||||
#include <byteswap.h>
|
# include <byteswap.h>
|
||||||
#elif defined(__APPLE__)
|
#elif defined(__APPLE__)
|
||||||
#include <libkern/OSByteOrder.h>
|
# include <libkern/OSByteOrder.h>
|
||||||
#define bswap_32 OSSwapInt32
|
# define bswap_32 OSSwapInt32
|
||||||
#define bswap_64 OSSwapInt64
|
# define bswap_64 OSSwapInt64
|
||||||
|
#elif defined(__OpenBSD__)
|
||||||
|
# include <endian.h>
|
||||||
|
# define bswap_32 swap32
|
||||||
|
# define bswap_64 swap64
|
||||||
#else
|
#else
|
||||||
#error "no byteswap.h or libkern/OSByteOrder.h"
|
# error "no byteswap.h or libkern/OSByteOrder.h"
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#include <stddef.h>
|
#include <stddef.h>
|
||||||
|
|||||||
@@ -69,6 +69,8 @@
|
|||||||
|
|
||||||
#define is_punctuation(type) ((type) >= PERIOD && (type) < OTHER)
|
#define is_punctuation(type) ((type) >= PERIOD && (type) < OTHER)
|
||||||
|
|
||||||
|
#define is_special_punctuation(type) ((type) == AMPERSAND || (type) == PLUS || (type) == POUND)
|
||||||
|
|
||||||
#define is_special_token(type) ((type) == EMAIL || (type) == URL || (type) == US_PHONE || (type) == INTL_PHONE)
|
#define is_special_token(type) ((type) == EMAIL || (type) == URL || (type) == US_PHONE || (type) == INTL_PHONE)
|
||||||
|
|
||||||
#define is_whitespace(type) ((type) == WHITESPACE)
|
#define is_whitespace(type) ((type) == WHITESPACE)
|
||||||
|
|||||||
@@ -84,6 +84,31 @@ TEST test_expansions_language_classifier(void) {
|
|||||||
PASS();
|
PASS();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
TEST test_expansions_no_options(void) {
|
||||||
|
normalize_options_t options = get_libpostal_default_options();
|
||||||
|
options.lowercase = false;
|
||||||
|
options.latin_ascii = false;
|
||||||
|
options.transliterate = false;
|
||||||
|
options.strip_accents = false;
|
||||||
|
options.decompose = false;
|
||||||
|
options.trim_string = false;
|
||||||
|
options.drop_parentheticals = false;
|
||||||
|
options.replace_numeric_hyphens = false;
|
||||||
|
options.delete_numeric_hyphens = false;
|
||||||
|
options.split_alpha_from_numeric = false;
|
||||||
|
options.replace_word_hyphens = false;
|
||||||
|
options.delete_word_hyphens = false;
|
||||||
|
options.delete_final_periods = false;
|
||||||
|
options.delete_acronym_periods = false;
|
||||||
|
options.drop_english_possessives = false;
|
||||||
|
options.delete_apostrophes = false;
|
||||||
|
options.expand_numex = false;
|
||||||
|
options.roman_numerals = false;
|
||||||
|
|
||||||
|
CHECK_CALL(test_expansion_contains_with_languages("120 E 96th St New York", "120 E 96th St New York", options, 0, NULL));
|
||||||
|
PASS();
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
SUITE(libpostal_expansion_tests) {
|
SUITE(libpostal_expansion_tests) {
|
||||||
|
|
||||||
@@ -94,6 +119,7 @@ SUITE(libpostal_expansion_tests) {
|
|||||||
|
|
||||||
RUN_TEST(test_expansions);
|
RUN_TEST(test_expansions);
|
||||||
RUN_TEST(test_expansions_language_classifier);
|
RUN_TEST(test_expansions_language_classifier);
|
||||||
|
RUN_TEST(test_expansions_no_options);
|
||||||
|
|
||||||
libpostal_teardown();
|
libpostal_teardown();
|
||||||
libpostal_teardown_language_classifier();
|
libpostal_teardown_language_classifier();
|
||||||
|
|||||||
Reference in New Issue
Block a user