1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
|
:: libTextCat 2.0 ::
What is it?
Libtextcat is a library with functions that implement the
classification technique described in Cavnar & Trenkle, "N-Gram-Based
Text Categorization" [1]. It was primarily developed for language
guessing, a task on which it is known to perform with near-perfect
accuracy.
The central idea of the Cavnar & Trenkle technique is to calculate a
"fingerprint" of a document with an unknown category, and compare this
with the fingerprints of a number of documents of which the categories
are known. The categories of the closest matches are output as the
classification. A fingerprint is a list of the most frequent n-grams
occurring in a document, ordered by frequency. Fingerprints are
compared with a simple out-of-place metric. See the article for more
details.
Considerable effort went into making this implementation fast and
efficient. The language guesser processes over 100 documents/second on
a simple PC, which makes it practical for many uses. It was developed
for use in our webcrawler and search engine software, in which it it
handles millions of documents a day.
Download
The library is released under the BSD License, which basicly states
that you can do anything you like with it as long as you mention us
and make it clear that this library is covered by the BSD License. It
also exempts us from any liability, should this library eat your hard
disc, kill your cat or classify your attorney's e-mails as spam.
The current version is 2.0.
As of yet there is no development version.
The distribution contains a configuration file for Gertjan van Noord's
language models, which are distributed under the GNU Public License.
Please note that this license does not allow you to distribute them as
a part of closed source software package. In time, we will provide you
with language models under a less restrictive license.
Installation
As of now, we have no autoconfig script. A "make all" in the src/ dir
should do the trick. If that doesn't work, you'll have to dive into
the mess we call our Makefile.
The library is known to compile flawlessly on GNU/Linux for x86. If
you manage to get it working on other systems, please drop us a note,
and we'll proudly add you to this page.
Quickstart: language guesser
Assuming that you have successfully compiled the library, you still
need some language models to start guessing languages. If you don't
feel like creating them yourself (cf. [2]Creating your own
fingerprints below), you can use the excellent collection of over 70
language models provided in Gertjan van Noord's "TextCat" package. We
have provided a configuration file for these models in the langclass
directory. Hack it at will.
* Download the textcat package at
http://odur.let.rug.nl/~vannoord/TextCat/
* mkdir ~/text_cat
* cd ~/text_cat
* tar xzf ~/your_download_dir/text_cat.tgz
* ~/libtextcat/src/testtextcat ~/libtextcat/langclass/conf.txt
Paste some text onto the commandline, and watch it get classified.
Using the API
Classifying the language of a textbuffer can be as easy as:
#include "textcat.h"
...
void *h = textcat_Init( "conf.txt" );
...
printf( "Language: %s\n", textcat_Classify(h, buffer, 400);
...
textcat_Done(h);
Creating your own fingerprints
The createfp program allows you to easily create your own document
fingerprints. Just feed it an example document on standard input, and
store the standard output:
% createfp < mydocument.txt > myfingerprint.txt
Put the names of your fingerprints in a configuration file, add some
id's and you're ready to classify.
Performance tuning
This library was made with efficiency in mind. There are couple of
parameters you may wish to tweak if you intend to use it for other
tasks than language guessing.
The most important thing is buffer size. For reliable language
guessing the classifier only needs a couple of hundreds of bytes max.
So don't feed it 100KB of text unless you are creating a fingerprint.
If you insist on feeding the classifier lots of text, try fiddling
with TABLEPOW, which determines the size of the hash table that is
used to store the n-grams. Making it too small will result in many
hashtable clashes, making it too large will cause wild memory
behaviour and both are bad for the performance.
Putting the most probable models at the top of the list in your config
file improves performance, because this will raise the threshold for
likely candidates more quickly.
Since the speed of the classifier is roughly linear with respect to
the number of models, you should consider how many models you really
need. In case of language guessing: do you really want to recognize
every language ever invented?
References
[1] The document that started it all can be downloaded at John M.
Trenkle's site: N-Gram-Based Text Categorization
http://www.novodynamics.com/trenkle/papers/sdair-94-bc.ps.gz
[2] The Perl implementation by Gertjan van Noord (code + language
models): downloadable from his [7]website
http://odur.let.rug.nl/~vannoord/TextCat/
Contact
Praise and flames may be directed at us through
libtextcat AT wise-guys.nl. If there is enough interest, we'll whip up
a mailing list. The current project maintainer is Frank Scheelen.
c. 2003 WiseGuys Internet B.V.
|