[readme] More informative README
This commit is contained in:
349
README.md
349
README.md
@@ -1,4 +1,347 @@
|
||||
# libpostal
|
||||
Fast, international postal address normalization in C
|
||||
___ __ __ ___
|
||||
/\_ \ __/\ \ /\ \__ /\_ \
|
||||
\//\ \ /\_\ \ \____ _____ ___ ____\ \ ,_\ __ \//\ \
|
||||
\ \ \ \/\ \ \ '__`\/\ '__`\ / __`\ /',__\\ \ \/ /'__`\ \ \ \
|
||||
\_\ \_\ \ \ \ \L\ \ \ \L\ \/\ \L\ \/\__, `\\ \ \_/\ \L\.\_ \_\ \_
|
||||
/\____\\ \_\ \_,__/\ \ ,__/\ \____/\/\____/ \ \__\ \__/.\_\/\____\
|
||||
\/____/ \/_/\/___/ \ \ \/ \/___/ \/___/ \/__/\/__/\/_/\/____/
|
||||
\ \_\
|
||||
\/_/
|
||||
---------------------------------------------------------------------
|
||||
|
||||
N.B. in the process of uploading everything to Github. Stay tuned...
|
||||
**N.B.**: libpostal is not publicly released yet and the APIs may change. We
|
||||
encourage folks to hold off on including it as a dependency for now.
|
||||
Stay tuned...
|
||||
|
||||
libpostal is a fast, multilingual, all-i18n-everything NLP library for
|
||||
normalizing and parsing physical addresses.
|
||||
|
||||
Addresses and the geographic coordinates they represent are essential for any
|
||||
location-based application (map search, transportation, on-demand/delivery
|
||||
services, check-ins, reviews). Yet even the simplest addresses are packed with
|
||||
local conventions, abbreviations and context, making them difficult to
|
||||
index/query effectively with traditional full-text search engines, which are
|
||||
designed for document indexing. This library helps convert the free-form
|
||||
addresses that humans use into clean normalized forms suitable for machine
|
||||
comparison and full-text indexing.
|
||||
|
||||
libpostal is not itself a full geocoder, but should be a ubiquitous
|
||||
preprocessing step before indexing/searching with free text geographic strings.
|
||||
It is written in C for maximum portability and performance.
|
||||
|
||||
libpostal's raison d'être
|
||||
-------------------------
|
||||
|
||||
libpostal was created as part of the [OpenVenues](https://github.com/openvenues/openvenues) project to solve
|
||||
the problem of place deduping. In OpenVenues, we have a data set of millions of
|
||||
places derived from terabytes of web pages from the [Common Crawl](http://commoncrawl.org/).
|
||||
The Common Crawl is published every month, and so even merging the results of
|
||||
two crawls produces significant duplicates.
|
||||
|
||||
Deduping is a relatively well-studied field, and for text documents like web
|
||||
pages, academic papers, etc. we've arrived at pretty decent approximate
|
||||
similarity methods such as [MinHash](https://en.wikipedia.org/wiki/MinHash).
|
||||
|
||||
However, for physical addresses, the frequent use of conventional abbreviations
|
||||
such as Road == Rd, California == CA, or New York City == NYC complicates
|
||||
matters a bit. Even using a technique like MinHash, which is well suited for
|
||||
approximate matches and is equivalent to the Jaccard similarity of two sets, we
|
||||
have to work with very short texts and it's often the case that two equivalent
|
||||
addresses, one abbreviated and one fully specified, will not match very closely
|
||||
in terms of n-gram set overlap. In non-Latin scripts, say a Russian address and
|
||||
its transliterated equivalent, it's conceivable that two addresses referring to
|
||||
the same place may not match even a single character.
|
||||
|
||||
libpostal aims to create normalized geographic strings, parsed into components,
|
||||
such that we can more effectively reason about how well two addresses
|
||||
actually match.
|
||||
|
||||
As a motivating example, consider the following two equivalent ways to write a
|
||||
particular Manhattan street address with varying conventions and degrees
|
||||
of verbosity:
|
||||
|
||||
- 30 W 26th St Fl #7
|
||||
- 30 West Twenty-sixth Street Floor Number 7
|
||||
|
||||
Obviously '30 W 26th St Fl #7 != '30 West Twenty-sixth Street Floor Number 7'
|
||||
in a string comparison sense, but a human can grok that these two addresses
|
||||
refer to the same physical location.
|
||||
|
||||
Isn't that geocoding?
|
||||
---------------------
|
||||
|
||||
If the above sounds a lot like geocoding, that's because it's very similar,
|
||||
only in the OpenVenues case, we do it without a UI or a user to select the
|
||||
correct address in an autocomplete. It's server-side batch geocoding
|
||||
(and you can too!)
|
||||
|
||||
Now, instead of giant Elasticsearch synonyms files, etc.
|
||||
geocoding can look like this:
|
||||
|
||||
1. Run the addresses in your index through libpostal
|
||||
2. Store the canonical strings
|
||||
3. Run your user queries through libpostal and search with those strings
|
||||
|
||||
Features
|
||||
--------
|
||||
|
||||
- **Abbreviation expansion**: e.g. expanding "rd" => "road" but for almost any
|
||||
language. libpostal supports > 50 languages and it's easy to add new languages
|
||||
or expand the current dictionaries. Ideographic languages (not separated by
|
||||
whitespace e.g. Chinese) are supported, as are Germanic languages where
|
||||
thoroughfare types are concatenated onto the end of the string, and may
|
||||
optionally be separated so Rosenstraße and Rosen Straße are equivalent.
|
||||
|
||||
- **International address parsing (coming soon)**: sequence model which parses
|
||||
"123 Main Street New York New York" into {"house_number": 123, "road":
|
||||
"Main Street", "city": "New York", "region": "New York"}. Unlike the majority
|
||||
of parsers out there, it works for a wide variety of countries and languages,
|
||||
not just US/English. The model is trained on > 40M OSM addresses, using the
|
||||
templates in the [OpenCage address formatting repo](https://github.com/OpenCageData/address-formatting) to construct formatted,
|
||||
tagged traning examples for most countries around the world.
|
||||
|
||||
- **Language classification (coming soon)**: multinomial logistic regression
|
||||
trained on all of OpenStreetMap ways, addr:* tags, toponyms and formatted
|
||||
addresses. Labels are derived using point-in-polygon tests in Quattroshapes
|
||||
and official/regional languages for countries and admin 1 boundaries
|
||||
respectively. So, for example, Spanish is the default language in Spain but
|
||||
in different regions e.g. Catalunya, Galicia, the Basque region, regional
|
||||
languages are the default. Dictionary-based disambiguation is employed in
|
||||
cases where the regional language is non-default e.g. Welsh, Breton, Occitan.
|
||||
|
||||
- **Numeric expression parsing** ("twenty first" => 21st,
|
||||
"quatre-vignt-douze" => 92, again using data provided in CLDR), supports > 30
|
||||
languages. Handles languages with concatenated expressions e.g.
|
||||
milleottocento => 1800. Optionally normalizes Roman numerals regardless of the
|
||||
language (IX => 9) which occur in the names of many monarchs, popes, etc.
|
||||
|
||||
- **Geographic name aliasing**: New York, NYC and Nueva York alias to New York
|
||||
City. Uses the crowd-sourced GeoNames (geonames.org) database, so alternate
|
||||
names added by contributors can automatically improve libpostal.
|
||||
|
||||
- **Geographic disambiguation (coming soon)**: There are several equally
|
||||
likely Springfields in the US (formally known as The Simpsons problem), and
|
||||
some context like a state is required to disambiguate. There are also > 1200
|
||||
distinct San Franciscos in the world but the term "San Francisco" almost always
|
||||
refers to the one in California. Williamsburg can refer to a neighborhood in
|
||||
Brooklyn or a city in Virginia. Geo disambiguation is a subset of Word Sense
|
||||
Disambiguation, and attempts to resolve place names in a string to GeoNames
|
||||
entities. This can be useful for city-level geocoding suitable for polygon/area
|
||||
lookup. By default, if there is no other context, as in the San Francisco case,
|
||||
the most populous entity will be selected.
|
||||
|
||||
- **Ambiguous token classification (coming soon)**: e.g. "dr" => "doctor" or
|
||||
"drive" for an English address depending on the context. Multiclass logistic
|
||||
regression trained on OSM addresses, where abbreviations are discouraged,
|
||||
giving us many examples of fully qualified addresses on which to train.
|
||||
|
||||
- **Fast, accurate tokenization/lexing**: clocked at > 1M tokens / sec,
|
||||
implements the TR-29 spec for UTF8 word segmentation, tokenizes East Asian
|
||||
languages chracter by character instead of on whitespace.
|
||||
|
||||
- **UTF8 normalization**: optionally decompose UTF8 to NFD normalization form,
|
||||
strips accent marks e.g. à => a and/or apply Latin-ASCII transliteration.
|
||||
|
||||
- **Transliteration**: e.g. улица => ulica or ulitsa. Uses all
|
||||
[CLDR transforms][http://www.unicode.org/repos/cldr/trunk/common/transforms/], which is what ICU uses,
|
||||
but libpostal doesn't require pulling in all of ICU (possibly conflicting with
|
||||
your system's version). Note: some languages, particularly Hebrew, Arabic
|
||||
and Thai may not include vowels andthus will not often match a transliteration
|
||||
done by a human. It may be possible to implement statistical transliterators
|
||||
for some of these languages.
|
||||
|
||||
- **Script detection**: Detects which script a given string uses (can be
|
||||
multiple e.g. a free-form Hong Kong or Macau address may use both Han and
|
||||
Latin scripts in the same address). In transliteration we can use all
|
||||
applicable transliterators for a given Unicode script (Greek can for instance
|
||||
be transliterated with Greek-Latin, Greek-Latin-BGN and Greek-Latin-UNGEGN).
|
||||
|
||||
Non-goals
|
||||
---------
|
||||
|
||||
- Verifying that a location is a valid address
|
||||
- Street-level geocoding
|
||||
|
||||
Examples of expansion
|
||||
---------------------
|
||||
|
||||
Like many problems in information extraction and NLP, address normalization
|
||||
may sound trivial initially, but in fact can be quite complicated in real
|
||||
natural language texts. Here are some examples of the kinds of address-specific
|
||||
challenges libpostal can handle:
|
||||
|
||||
| Input | Output |
|
||||
| ----------------------------------- |---------------------------------------|
|
||||
| One-hundred twenty E 96th St | 120 east 96th street |
|
||||
| C/ Ocho, P.I. 4 | calle 8, polígono industrial 4 |
|
||||
| V XX Settembre, 20 | via 20 settembre, 20 |
|
||||
| Quatre vignt douze Rue de l'Église | 92 rue de l' église |
|
||||
| ул Каретный Ряд, д 4, строение 7 | улица каретныи ряд, дом 4, строение 7 |
|
||||
| ул Каретный Ряд, д 4, строение 7 | ulica karetnyj rad, dom 4, stroenie 7 |
|
||||
| Marktstrasse 14 | markt straße 14 |
|
||||
|
||||
For further reading and some less intuitive examples of addresses, see
|
||||
"[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/)".
|
||||
|
||||
Why C (you crazy person)?
|
||||
-------------------------
|
||||
|
||||
libpostal is written in C for three reasons (in order of importance):
|
||||
|
||||
1. **Portability/ubiquity**: libpostal targets higher-level languages that
|
||||
people actually use day-to-day: Python, Go, Ruby, NodeJS, etc. The beauty of C
|
||||
is that just about any programming language can bind to it and C compilers are
|
||||
everywhere, so pick your favorite, write a binding, and you can use libpostal
|
||||
directly in your application without having to stand up a separate server. We
|
||||
support Mac/Linux (Windows is not a priority but happy to accept patches), have
|
||||
a standard autotools build and an endianness-agnostic file format for the data
|
||||
files. The Python bindings, are maintained as part of this repo since they're
|
||||
needed to construct the training data.
|
||||
|
||||
2. **Memory-efficiency**: libpostal is designed to run in a MapReduce setting
|
||||
where we may be limited to < 1GB of RAM per process depending on the machine
|
||||
configuration. As much as possible libpostal uses contiguous arrays, tries
|
||||
(built on contiguous arrays), bloom filters and compressed sparse matrices to
|
||||
keep memory usage low. It's conceivable that libpostal could even be used on
|
||||
a mobile device, although that's not an explicit goal of the project.
|
||||
|
||||
3. **Performance**: this is last on the list for a reason. Most of the
|
||||
optimizations in libpostal are for memory usage rather than performance.
|
||||
libpostal is quite fast given the amount of work it does. It can process
|
||||
10-30k addresses / second in a single thread/process on the platforms we've
|
||||
tested (that means processing every address in OSM planet in a little over
|
||||
an hour). Check out the simple benchmark program to test on your environment
|
||||
and various types of input. In the MapReduce setting, per-core performance
|
||||
isn't as important because everything's being done in parallel, but there are
|
||||
some streaming ingestion applications at Mapzen where this needs to
|
||||
run in-process.
|
||||
|
||||
Design philosophy
|
||||
-----------------
|
||||
|
||||
libpostal is written in modern, legible, C99.
|
||||
|
||||
- Keep it object-oriented(-ish)
|
||||
- Confine almost all mallocs to *name*_new and all frees to *name*_destroy
|
||||
- Don't write custom hashtables, sorting algorithms, other undergrad CS stuff
|
||||
- Use generic containers from klib where possible
|
||||
- Take advantage of sparsity in all data structures
|
||||
- Use char_array (inspired by [sds](https://github.com/antirez/sds)) when possible instead of C strings.
|
||||
- Throughly test for memory leaks before pushing
|
||||
- Keep it reasonably cross-platform compatible, particularly for *nix
|
||||
|
||||
Language dictinonaries
|
||||
----------------------
|
||||
|
||||
It's easy to add new languages/synonyms to libpostal by modifying a few text
|
||||
files. The format of each dictionary file roughly resembles a
|
||||
Lucene/Elasticsearch synonyms file:
|
||||
|
||||
```
|
||||
drive|dr
|
||||
street|st|str
|
||||
road|rd
|
||||
```
|
||||
|
||||
The leftmost string is treated as the canonical/normalized version. Synonyms
|
||||
if any, are appended to the right, delimited by the pipe character.
|
||||
|
||||
The supported languages can be found in the [resources/dictionaries](https://github.com/openvenues/libpostal/tree/master/resources/dictionaries).
|
||||
|
||||
Each language can define one or more dictionaries (sometimes called "gazetteers" in NLP) to help with address parsing, and normalizing abbreviations. The dictionary types are:
|
||||
|
||||
- **academic_degrees.txt**: for post-nominal strings like "M.D.", "Ph.D.", etc.
|
||||
- **ambiguous_expansions.txt**: e.g. "E" could be expanded to "East" or could
|
||||
be "E Street", so if the string it encountered, it can either be left alone or expanded
|
||||
- **building_types.txt**: strings indicating a building/house
|
||||
- **company_types.txt**: company suffixes like "Inc" or "GmbH"
|
||||
- **concatenated_prefixes_separable.txt**: things like "Hinter..." which can
|
||||
be written either concatenated or as separate tokens
|
||||
- **concatenated_suffixes_inseparable.txt**: Things like "...bg." => "...burg"
|
||||
where the suffix cannot be separated from the main token, but either has an
|
||||
abbreviated equivalent or simply can help identify the token in parsing as,
|
||||
say, part of a street name
|
||||
- **directionals.txt**: strings indicating directions (cardinal and
|
||||
lower/central/upper, etc.)
|
||||
- **level_types.txt**: strings indicating a particular floor
|
||||
- **no_number.txt**: strings like "no fixed address"
|
||||
- **nulls.txt**: strings meaning "not applicable"
|
||||
- **personal_suffixes.txt**: post-nominal suffixes, usually generational
|
||||
like Jr/Sr
|
||||
- **personal_titles.txt**: civilian, royal and military titles
|
||||
- **place_names.txt**: strings found in names of places e.g. "theatre",
|
||||
"aquarium", "restaurant". See [Nominatim Special Phrases](http://wiki.openstreetmap.org/wiki/Nominatim/Special_Phrases)
|
||||
- **post_office.txt**: strings like "p.o. box"
|
||||
- **qualifiers.txt**: strings like "township"
|
||||
- **stopwords.txt**: prepositions and articles mostly, very common words
|
||||
which may be ignored in some contexts
|
||||
- **street_types.txt**: words like "street", "road", "drive" which indicate
|
||||
a thoroughfare and their respective abbreviations.
|
||||
- **synonyms.txt**: any miscellaneous synonyms/abbreviations e.g. "bros"
|
||||
expands to "brothers", etc. These have no special meaning and will essentially
|
||||
just be treated as string replacement.
|
||||
- **toponyms.txt**: abbreviations for certain abbreviations relating to
|
||||
toponyms like regions, places, etc. Note: GeoNames covers most of these.
|
||||
In most cases better to leave these alone
|
||||
- **unit_types.txt**: strings indicating an apartment or unit number
|
||||
|
||||
Most of the dictionaries have been derived with the following process:
|
||||
|
||||
1. Tokenize all the streets in OSM for a particular language
|
||||
2. Count the words
|
||||
3. Optionally use frequent item set mining to get frequent phrases
|
||||
4. Run the most frequent words/phrases through Google Translate
|
||||
5. Add the ones that mean "street" to dictionaries
|
||||
6. Research thoroughfare types in a given country
|
||||
|
||||
In the future it might be beneficial to move the dictionaries to a wiki
|
||||
so they can be crowdsourced by native speakers regardless of whether or not
|
||||
they use git.
|
||||
|
||||
Installation
|
||||
------------
|
||||
|
||||
For C users or those writing bindings (if you've written a languag
|
||||
binding, please let us know!):
|
||||
|
||||
```
|
||||
./bootstrap.sh
|
||||
./configure --datadir=[...some dir with a few GB of space...]
|
||||
make
|
||||
sudo make install
|
||||
```
|
||||
|
||||
libpostal needs to download some data files from S3. This is done automatically
|
||||
when you run make. Mapzen maintains an S3 bucket containing said data files
|
||||
but they can also be built manually.
|
||||
|
||||
To install via Python, just use:
|
||||
|
||||
```
|
||||
pip install https://github.com/openvenues/libpostal.git
|
||||
```
|
||||
|
||||
Command-line usage
|
||||
------------------
|
||||
|
||||
After building libpostal:
|
||||
|
||||
```
|
||||
cd src/
|
||||
|
||||
./libpostal "12 Three-hundred and forty-fifth ave, ste. no 678" en
|
||||
#12 345th avenue, suite number 678
|
||||
|
||||
```
|
||||
|
||||
Currently libpostal requires two input strings, the address text and a language
|
||||
code (ISO 639-1).
|
||||
|
||||
Todos
|
||||
-----
|
||||
|
||||
1. Finish debugging/fully train address parser and publish model
|
||||
2. Port language classification from Python, train and publish model
|
||||
3. Python bindings and documentation
|
||||
4. Publish tests (currently not on Github) and set up continuous integration
|
||||
5. Hosted documentation
|
||||
Reference in New Issue
Block a user