[readme] Readme fixes and additions

This commit is contained in:
Al
2015-09-26 23:32:19 -04:00
parent 5b829cd5a7
commit a3214b7914

View File

@@ -185,8 +185,8 @@ challenges libpostal can handle:
For further reading and some less intuitive examples of addresses, see For further reading and some less intuitive examples of addresses, see
"[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/)". "[Falsehoods Programmers Believe About Addresses](https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/)".
Why C (i.e. are you crazy)? Why C?
--------------------------- ------
libpostal is written in C for three reasons (in order of importance): libpostal is written in C for three reasons (in order of importance):
@@ -218,22 +218,35 @@ isn't as important because everything's being done in parallel, but there are
some streaming ingestion applications at Mapzen where this needs to some streaming ingestion applications at Mapzen where this needs to
run in-process. run in-process.
Design philosophy C codebase
----------------- ----------
libpostal is written in modern, legible, C99. libpostal is written in modern, legible, C99 and uses the following conventions:
- Keep it roughly object-oriented, as allowed by C - Roughly object-oriented, as much as allowed by C
- Confine almost all mallocs to *name*_new and all frees to *name*_destroy - Almost no pointer-based data structures, arrays all the way down
- Don't write custom hashtables, sorting algorithms, other undergrad CS stuff - Uses dynamic character arrays (inspired by [sds](https://github.com/antirez/sds)) for safer string handling
- Use generic containers from klib where possible - Confines almost all mallocs to *name*_new and all frees to *name*_destroy
- Take advantage of sparsity in all data structures - Efficient existing implementations for simple things like hashtables
- Use char_array (inspired by [sds](https://github.com/antirez/sds)) when possible instead of C strings. - Generic containers (via [klib](https://github.com/attractivechaos/klib)) whenever possible
- Throughly test for memory leaks before pushing - Data structrues take advantage of sparsity as much as possible
- Keep it reasonably cross-platform compatible, particularly for *nix - Efficient double-array trie implementation for most string dictionaries
- Tries to stay cross-platform as much as possible, particularly for *nix
Python codebase
---------------
There are actually two Python packages in libpostal.
1. **geodata**: generates C files and data sets used in the C build
2. **pypostal**: Python bindings for libpostal
geodata is simply a confederation of scripts which share some common code.
Said scripts shouldn't be needed for most users unless you're rebuilding data
files for the C lib.
Language dictionaries Language dictionaries
---------------------- ---------------------
It's easy to add new languages/synonyms to libpostal by modifying a few text It's easy to add new languages/synonyms to libpostal by modifying a few text
files. The format of each dictionary file roughly resembles a files. The format of each dictionary file roughly resembles a
@@ -291,9 +304,9 @@ In most cases better to leave these alone
Most of the dictionaries have been derived with the following process: Most of the dictionaries have been derived with the following process:
1. Tokenize all the streets in OSM for language x 1. Tokenize every street name in OSM for language x
2. Count the most common N tokens 2. Count the most common N tokens
3. Optionally use frequent item set mining to get frequent phrases 3. Optionally use frequent item set techniques to exctract phrases
4. Run the most frequent words/phrases through Google Translate 4. Run the most frequent words/phrases through Google Translate
5. Add the ones that mean "street" to dictionaries 5. Add the ones that mean "street" to dictionaries
6. Augment by researching addresses in countries speaking language x 6. Augment by researching addresses in countries speaking language x
@@ -305,7 +318,7 @@ they use git.
Installation Installation
------------ ------------
For C users or those writing bindings (if you've written a languag For C users or those writing bindings (if you've written a language
binding, please let us know!): binding, please let us know!):
``` ```