Merge branch 'master' into rust-bindings-mention
This commit is contained in:
63
README.md
63
README.md
@@ -1,6 +1,6 @@
|
||||
# libpostal: international street address NLP
|
||||
|
||||
[](https://travis-ci.org/openvenues/libpostal)
|
||||
[](https://github.com/openvenues/libpostal/actions)
|
||||
[](https://ci.appveyor.com/project/albarrentine/libpostal/branch/master)
|
||||
[](https://github.com/openvenues/libpostal/blob/master/LICENSE)
|
||||
[](#sponsors)
|
||||
@@ -98,7 +98,7 @@ Before you install, make sure you have the following prerequisites:
|
||||
|
||||
**On Ubuntu/Debian**
|
||||
```
|
||||
sudo apt-get install curl autoconf automake libtool pkg-config
|
||||
sudo apt-get install -y curl build-essential autoconf automake libtool pkg-config
|
||||
```
|
||||
|
||||
**On CentOS/RHEL**
|
||||
@@ -113,12 +113,26 @@ brew install curl autoconf automake libtool pkg-config
|
||||
|
||||
Then to install the C library:
|
||||
|
||||
If you're using an M1 Mac, add `--disable-sse2` to the `./configure` command. This will result in poorer performance but the build will succeed.
|
||||
|
||||
```
|
||||
git clone https://github.com/openvenues/libpostal
|
||||
cd libpostal
|
||||
|
||||
./bootstrap.sh
|
||||
./configure --datadir=[...some dir with a few GB of space...]
|
||||
./configure --datadir=[...some dir with a few GB of space where a "libpostal" directory exists or can be created/modified...]
|
||||
make -j4
|
||||
|
||||
# For Intel/AMD processors and the default model
|
||||
./configure --datadir=[...some dir with a few GB of space...]
|
||||
|
||||
# For Apple / ARM cpus and the default model
|
||||
./configure --datadir=[...some dir with a few GB of space...] --disable-sse2
|
||||
|
||||
# For the improved Senzing model:
|
||||
./configure --datadir=[...some dir with a few GB of space...] MODEL=senzing
|
||||
|
||||
make -j8
|
||||
sudo make install
|
||||
|
||||
# On Linux it's probably a good idea to run
|
||||
@@ -175,6 +189,24 @@ If you require a .lib import library to link this to your application. You can g
|
||||
lib.exe /def:libpostal.def /out:libpostal.lib /machine:x64
|
||||
```
|
||||
|
||||
Installation with an alternative data model
|
||||
-------------------------------------------
|
||||
|
||||
An alternative data model is available for libpostal. It is created by Senzing Inc. for improved parsing on US, UK and Singapore addresses and improved US rural route address handling.
|
||||
To enable this add `MODEL=senzing` to the conigure line during installation:
|
||||
```
|
||||
./configure --datadir=[...some dir with a few GB of space...] MODEL=senzing
|
||||
```
|
||||
|
||||
The data for this model is gotten from [OpenAddress](https://openaddresses.io/), [OpenStreetMap](https://www.openstreetmap.org/) and data generated by Senzing based on customer feedback (a few hundred records), a total of about 1.2 billion records of data from over 230 countries, in 100+ languages. The data from OpenStreetMap and OpenAddress is good but not perfect so the data set was modified by filtering out badly formed addresses, correcting misclassified address tokens and removing tokens that didn't belong in the addresses, whenever these conditions were encountered.
|
||||
|
||||
Senzing created a data set of 12950 addresses from 89 countries that it uses to test and verify the quality of its models. The data set was generated using random addresses from OSM, minimally 50 per country. Hard-to-parse addresses were gotten from Senzing support team and customers and from the libpostal github page and added to this set. The Senzing model got 4.3% better parsing results than the default model, using this test set.
|
||||
|
||||
The size of this model is about 2.2GB compared to 1.8GB for the default model so keep that in mind if storages space is important.
|
||||
|
||||
Further information about this data model can be found at: https://github.com/Senzing/libpostal-data
|
||||
If you run into any issues with this model, whether they have to do with parses, installation or any other problems, then please report them at https://github.com/Senzing/libpostal-data
|
||||
|
||||
Examples of parsing
|
||||
-------------------
|
||||
|
||||
@@ -382,23 +414,19 @@ Libpostal is designed to be used by higher-level languages. If you don't see yo
|
||||
- LuaJIT: [lua-resty-postal](https://github.com/bungle/lua-resty-postal)
|
||||
- Perl: [Geo::libpostal](https://metacpan.org/pod/Geo::libpostal)
|
||||
- Elixir: [Expostal](https://github.com/SweetIQ/expostal)
|
||||
- Haskell: [haskell-postal](http://github.com/netom/haskell-postal)
|
||||
- Rust: [rust-postal](https://github.com/pnordahl/rust-postal)
|
||||
- Rust: [rustpostal](https://crates.io/crates/rustpostal)
|
||||
|
||||
**Database extensions**
|
||||
**Unofficial database extensions**
|
||||
|
||||
- PostgreSQL: [pgsql-postal](https://github.com/pramsey/pgsql-postal)
|
||||
|
||||
**Unofficial REST API**
|
||||
**Unofficial servers**
|
||||
|
||||
- Libpostal REST: [libpostal REST](https://github.com/johnlonganecker/libpostal-rest)
|
||||
|
||||
**Libpostal REST Docker**
|
||||
|
||||
- Libpostal REST Docker [Libpostal REST Docker](https://github.com/johnlonganecker/libpostal-rest-docker)
|
||||
|
||||
**Libpostal ZeroMQ Docker**
|
||||
|
||||
- Libpostal ZeroMQ Docker image: [pasupulaphani/libpostal-zeromq](https://hub.docker.com/r/pasupulaphani/libpostal-zeromq/) , Source: [Github](https://github.com/pasupulaphani/libpostal-docker)
|
||||
- Libpostal REST Go Docker: [libpostal-rest-docker](https://github.com/johnlonganecker/libpostal-rest-docker)
|
||||
- Libpostal REST FastAPI Docker: [libpostal-fastapi](https://github.com/alpha-affinity/libpostal-fastapi)
|
||||
- Libpostal ZeroMQ Docker: [libpostal-zeromq](https://github.com/pasupulaphani/libpostal-docker)
|
||||
|
||||
|
||||
Tests
|
||||
@@ -473,7 +501,7 @@ optionally be separated so Rosenstraße and Rosen Straße are equivalent.
|
||||
for a wide variety of countries and languages, not just US/English.
|
||||
The model is trained on over 1 billion addresses and address-like strings, using the
|
||||
templates in the [OpenCage address formatting repo](https://github.com/OpenCageData/address-formatting) to construct formatted,
|
||||
tagged traning examples for every inhabited country in the world. Many types of [normalizations](https://github.com/openvenues/libpostal/blob/master/scripts/geodata/addresses/components.py)
|
||||
tagged training examples for every inhabited country in the world. Many types of [normalizations](https://github.com/openvenues/libpostal/blob/master/scripts/geodata/addresses/components.py)
|
||||
are performed to make the training data resemble real messy geocoder input as closely as possible.
|
||||
|
||||
- **Language classification**: multinomial logistic regression
|
||||
@@ -495,7 +523,7 @@ language (IX => 9) which occur in the names of many monarchs, popes, etc.
|
||||
|
||||
- **Fast, accurate tokenization/lexing**: clocked at > 1M tokens / sec,
|
||||
implements the TR-29 spec for UTF8 word segmentation, tokenizes East Asian
|
||||
languages chracter by character instead of on whitespace.
|
||||
languages character by character instead of on whitespace.
|
||||
|
||||
- **UTF8 normalization**: optionally decompose UTF8 to NFD normalization form,
|
||||
strips accent marks e.g. à => a and/or applies Latin-ASCII transliteration.
|
||||
@@ -519,6 +547,7 @@ Non-goals
|
||||
|
||||
- Verifying that a location is a valid address
|
||||
- Actually geocoding addresses to a lat/lon (that requires a database/search index)
|
||||
- Extracting addresses from free text
|
||||
|
||||
Raison d'être
|
||||
-------------
|
||||
@@ -624,7 +653,7 @@ libpostal is written in modern, legible, C99 and uses the following conventions:
|
||||
- Confines almost all mallocs to *name*_new and all frees to *name*_destroy
|
||||
- Efficient existing implementations for simple things like hashtables
|
||||
- Generic containers (via [klib](https://github.com/attractivechaos/klib)) whenever possible
|
||||
- Data structrues take advantage of sparsity as much as possible
|
||||
- Data structures take advantage of sparsity as much as possible
|
||||
- Efficient double-array trie implementation for most string dictionaries
|
||||
- Cross-platform as much as possible, particularly for *nix
|
||||
|
||||
|
||||
Reference in New Issue
Block a user