summaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authorJehan <jehan@girinstud.io>2020-04-29 16:12:54 +0200
committerJehan <jehan@girinstud.io>2020-04-29 16:20:00 +0200
commitc8a3572cca834d687b478522385530645a261d40 (patch)
tree7a017f560c411f48a4204fe8853fe4509bdab3ad /README.md
parent472a906844ef0428a2e9367294db68ed343242f6 (diff)
Issue #17: update README.
Replace the old link to the science paper by one on archive-mozilla website. Remove the original source link as I can't find any archived version of it (even on archive.org, only the folder structure is saved, not actual files themselves, so it's useless). Also add some history, which is probably a nice touch. Add a link to crossroad to help people who'd want to cross-compile uchardet. Finally add the R binding by Artem Klevtsov and QtAV as reported.
Diffstat (limited to 'README.md')
-rw-r--r--README.md41
1 files changed, 36 insertions, 5 deletions
diff --git a/README.md b/README.md
index a2713ae..bf09091 100644
--- a/README.md
+++ b/README.md
@@ -4,10 +4,6 @@
uchardet started as a C language binding of the original C++ implementation of the universal charset detection library by Mozilla. It can now detect more charsets, and more reliably than the original implementation.
-The original code of universalchardet is available at http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/
-
-Techniques used by universalchardet are described at http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
-
## Supported Languages/Encodings
* International (Unicode)
@@ -194,7 +190,8 @@ to use MinGW-w64 instead of MinGW, in particular to build both 32 and
64-bit DLL libraries).
Note also that it is very easily cross-buildable (for instance from a
-GNU/Linux machine).
+GNU/Linux machine; [crossroad](https://pypi.org/project/crossroad/) may
+help, this is what we use in our CI).
### Build from source
@@ -254,8 +251,41 @@ Options:
See [uchardet.h](https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/master/src/uchardet.h)
+## History
+
+As said in introduction, this was initially a project of Mozilla to
+allow better detection of page encodings, and it used to be part of
+Firefox. If not mistaken, this is not the case anymore (probably because
+nowadays most websites better announce their encoding, and also UTF-8 is
+much more widely spread).
+
+Techniques used by universalchardet are described at https://www-archive.mozilla.org/projects/intl/universalcharsetdetection
+
+It is to be noted that a lot has changed since the original code, yet
+the base concept is still around, basing detection not just on encoding
+rules, but importantly on analysis of character statistics in languages.
+
+Original code by Mozilla does not seem to be found anymore anywhere, but
+it's probably not too far from the initial commit of this repository.
+
+Mozilla code was extracted and packaged into a standalone library under
+the name `uchardet` by BYVoid in 2011, in a personal repository.
+Starting 2015, I (i.e. Jehan) started contributing, "standardized"
+the output to be iconv-compatible, added various encoding/language
+support and streamlined generation of sources for new support of
+encoding/languages by using texts from Wikipedia as statistics source on
+languages through Python scripts. Then I soon became co-maintainer.
+In 2016, `uchardet` became a freedesktop project.
+
## Related Projects
+Some of these are bindings of `uchardet`, others are forks of the same
+initial code, which has diverged over time, others are native port in
+other languages.
+This list is not exhaustive and only meant as point of interest. We
+don't follow the status for these projects.
+
+ * [R-uchardet](https://cran.r-project.org/package=uchardet) R binding on CRAN
* [python-chardet](https://github.com/chardet/chardet) Python port
* [ruby-rchardet](http://rubyforge.org/projects/chardet/) Ruby port
* [juniversalchardet](http://code.google.com/p/juniversalchardet/) Java port of universalchardet
@@ -272,6 +302,7 @@ See [uchardet.h](https://gitlab.freedesktop.org/uchardet/uchardet/-/blob/master/
* [Tepl](https://wiki.gnome.org/Projects/Tepl)
* [Nextcloud IOS app](https://github.com/nextcloud/ios)
* [Codelite](https://codelite.org)
+* [QtAV](https://www.qtav.org/)
* …
## Licenses