[docs] README updates, better explanations of normalization and parsing

2015-12-16 02:19:10 -05:00
parent 3e44910664
commit 59cc6d3417
1 changed files with 63 additions and 38 deletions
--- a/README.md
+++ b/README.md
@@ -38,35 +38,57 @@ It is written in C for maximum portability and performance.
 Examples of normalization
 -------------------------

-Like many problems in information extraction and NLP, address normalization
-may sound trivial initially, but in fact can be quite complicated in real
-natural language texts. Here are some examples of the kinds of address-specific
-challenges libpostal can handle:
+Address normalization may sound trivial initially, especially when thinking
+only about the US (if that's where you happen to reside), but it only takes
+a few examples to realize how complicated natural language addresses are
+internationally. Here's a short list of some less straightforward normalizations
+in various languages. The left/right columns in this table are equivalent
+strings under libpostal, the left column being user input and the right column
+being the indexed (normalized) string.

 | Input                               | Output (may be multiple in libpostal) |
 | ----------------------------------- |---------------------------------------|
 | One-hundred twenty E 96th St        | 120 east 96th street                  |
-| C/ Ocho, P.I. 4                     | calle 8, polígono industrial 4        |
-| V XX Settembre, 20                  | via 20 settembre, 20                  |
+| C/ Ocho, P.I. 4                     | calle 8 polígono industrial 4         |
+| V XX Settembre, 20                  | via 20 settembre 20                   |
 | Quatre vignt douze R. de l'Église   | 92 rue de l' église                   |
-| ул Каретный Ряд, д 4, строение 7    | улица каретныи ряд, дом 4, строение 7 |
-| ул Каретный Ряд, д 4, строение 7    | ulica karetnyj rad, dom 4, stroenie 7 |
+| ул Каретный Ряд, д 4, строение 7    | улица каретныи ряд дом 4 строение 7   |
+| ул Каретный Ряд, д 4, строение 7    | ulica karetnyj rad dom 4 stroenie 7   |
 | Marktstrasse 14                     | markt straße 14                       |

-For further reading and some less intuitive examples of addresses, see
-"[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/)".
+libpostal currently supports these types of normalization in *over 60 languages*,
+and you can add more (without having to write any C!)

+Now, instead of trying to bake address-specific conventions into traditional
+document search engines like Elasticsearch using giant synonyms files, scripting,
+custom analyzers, tokenizers, and the like, geocoding can be as simple as:
+
+1. Run the addresses in your index through libpostal's expand_address
+2. Store the normalized string(s) in your favorite search engine, DB, 
+   hashtable, etc.
+3. Run your user queries or fresh imports through libpostal and search
+   the existing database using those strings
+
+In this way, libpostal can perform fuzzy address matching in constant time.
+
+For further reading and some bizarre address edge-cases, see:
+[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/).

 Examples of parsing
 -------------------

-libpostal's address parser is trained on ~50M addresses (everything in OSM),
-using the address format templates in https://github.com/OpenCageData/address-formatting
-and perturbing the inputs in a number of ways to make the parser as robust
-as possible to messy real-world input.
+libpostal implements the first truly international statistical address parser,
+trained on ~50 million addresses in over 100 countries speaking over 60
+languages. We use OpenStreetMap (anything with an addr:* tag) and the OpenCage
+address format templates at: https://github.com/OpenCageData/address-formatting
+to construct the training data, supplementing with containing polygons and
+perturbing the inputs in a number of ways to make the parser as robust as possible
+to messy real-world input.

-These examples are taken from the interactive address_parser program that builds
-with libpostal on make.
+These example parses are taken from the interactive address_parser program 
+that builds with libpostal on make. Note that the parser doesn't care about commas
+vs. no commas, casing, or different permutations of components (if components are
+left out e.g. just city or just city/postcode).

 ```
 > 781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA 
@@ -153,6 +175,10 @@ Result:
 }
 ```

+The parser achieves very high accuracy on held-out data, currently 98.9%
+correct full parses (meaning a 1 in the numerator for getting *every* token
+in the address correct).
+
 Installation
 ------------

@@ -228,10 +254,9 @@ After building libpostal:
 cd src/

 ./address_parser
-
 ```

-address_parser is an interactive shell, just type addresses and libpostal will
+address_parser is an interactive shell. Just type addresses and libpostal will
 parse them and print the result.

 Data files
@@ -242,7 +267,6 @@ representations of the data structures necessary to perform expansion. For addre
 parsing, since model training takes about a day, we publish the fully trained model 
 to S3 and will update it automatically as new addresses get added to OSM.

-
 Data files are automatically downloaded when you run make. To check for and download
 any new data files, run:

@@ -375,15 +399,10 @@ So it's not a geocoder?

 If the above sounds a lot like geocoding, that's because it is in a way,
 only in the OpenVenues case, we do it without a UI or a user to select the
-correct address in an autocomplete. libpostal does server-side batch geocoding
-(and you can too!)
-
-Now, instead of fiddling with giant Elasticsearch synonyms files, scripting,
-analyzers, tokenizers, and the like, geocoding can look like this:
-
-1. Run the addresses in your index through libpostal
-2. Store the canonical strings
-3. Run your user queries through libpostal and search with those strings
+correct address in an autocomplete. Given a database of source addresses
+such as OpenAddresses or OpenStreetMap (or all of the above), libpostal
+can be used to implement things like address deduping and server-side
+batch geocoding in settings like MapReduce.

 Why C?
 ------
@@ -542,20 +561,26 @@ There are four primary ways the address parser can be improved even further
 1. Contribute addresses to OSM. Anything with an addr:housenumber tag will be
   incorporated automatically into the parser next time it's trained.
 2. If the address parser isn't working well for a particular country, language
-   or style of address, chances are that the template can be added at:
-   https://github.com/OpenCageData/address-formatting. This repo helps us
-   format OSM addresses and create the training data used by the address parser.
+   or style of address, chances are that some name variations or places being
+   missed/mislabeled during training data creation. Sometimes the fix is to
+   add more countries at: https://github.com/OpenCageData/address-formatting,
+   and in many other cases there are relatively simple tweaks we can make
+   when creating the training data that will ensure the model is trained to
+   handle your use case without you having to do any manual data entry.
+   If you see a pattern of obviously bad address parses, post an issue to
+   Github and we'll tr
 3. We currently don't have training data for things like flat numbers.
   The tags are fairly uncommon in OSM and the address-formatting templates
   don't use floor, level, apartment/flat number, etc. This would be a slightly
-   more involved effort, but would be happy to discuss.
-4. Moving to a CRF may improve parser performance on certain kinds of input
-   since the score is the argmax over the entire sequence not just the token.
-   This may slow down training significantly. 
+   more involved effort, but would be like to begin a discussion around it.
+4. We use a greedy averaged perceptron for the parser model. Viterbi inference
+   using a linear-chain CRF may improve parser performance on certain classes
+   of input since the score is the argmax over the entire label sequence not
+   just the token. This may slow down training significantly.

 Todos
 -----

-1. Port language classification from Python, train and publish model
-2. Publish tests (currently not on Github) and set up continuous integration
-3. Hosted documentation
+[ ] Port language classification from Python, train and publish model
+[ ] Publish tests (currently not on Github) and set up continuous integration
+[ ] Hosted documentation