Age | Commit message (Collapse) | Author | Files | Lines |
|
So apparently Freedesktop CI won't run on non-official project or
non-known developers Gitlab namespaces. In particular, it makes CI fail
on merge requests by such passing-by contributors!
Adding these small rules is supposed to allow such jobs to run anyway.
See: https://gitlab.freedesktop.org/freedesktop/freedesktop/-/issues/540
|
|
|
|
Actually my previous commit was not handling all cases, though it was
taking care of the buffer overflow triggered by the provided byte
sequence. Yet I believe it was still possible to craft special input
sequences too long for codePointBuffer.
This additional commit would handle these other cases by processing the
input in manageable sub-strings.
|
|
… a heap allocated buffer.
Before starting to process a multi-byte sequence, we should make sure
that our buffer is not nearly full with single-byte data. If so, process
said data first.
|
|
This fixes this bug reported by ASAN:
> ==42862==ERROR: AddressSanitizer: alloc-dealloc-mismatch (operator new [] vs operator delete) on 0x619000000080
> #0 0x7f1dc1fa2017 in operator delete(void*) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:160
> #1 0x7f1dc1e8b132 in nsSBCSGroupProber::~nsSBCSGroupProber() /home/jehan/dev/src/uchardet/src/nsSBCSGroupProber.cpp:257
|
|
|
|
|
|
- Avoid trailing whitespaces.
- Print which tool and version were used for the generation (to help for
future debugging in case of discrepancies between versions or
implementations).
|
|
For charsets UTF-8, GEORGIAN-ACADEMY and GEORGIAN-PS. The 2 GEORGIAN-*
sets were generated thanks to the new create-table.py script.
Test text comes from page 'ვირზაზუნა' page of Wikipedia in Georgian.
|
|
I wanted to add new tables for which I could find no listing anywhere,
even though iconv has support for it (not core Python though), which are
GEORGIAN-ACADEMY and GEORGIAN-PS.
I could find info on these in libiconv source (./lib/georgian_academy.h
and ./lib/georgian_ps.h), though rather than trying to read these, I
thought I should just do the other way around: get back a table from the
return value of iconv API (or Python decode() when relevant).
So this script is able to generate tables in the format used under
script/charsets/, from either Python decode() or iconv. It will be much
useful!
|
|
|
|
For UTF-8, ISO-8859-1 and WINDOWS-1252 support.
The test for UTF-8 and ISO-8859-1 is taken from 'Marmota' page on
Wikipedia in Catalan. The test for WINDOWS-1252 is taken from the
'Unió_Europea' page. ISO-8859-1 and WINDOWS-1252 being very similar,
regarding most letters (in particular the ones used in Catalan), I
differentiated the test with a text containing the '€' symbol, which is
on an unused spot in ISO-8859-1.
|
|
Rather than using a huge frequency table through some state machine code
that I don't even understand, I noticed that the Big5 encoding is from
the start organized in frequent and non-frequent characters tables (per
Wikipedia page on Big5). This makes it very easy to count characters by
just counting which class each character is in.
Making a few tests with random Chinese text converted to Big5, it seems
to work pretty well (and fix the test which got broken with previous
commit), and it doesn't slow down detection in any significant way
either.
This may be the next step towards improving also the various multi-byte
encoding detection, which are still using some coding state generated
machines which mostly still elude me.
|
|
It actually breaks "zh:big5" so I'm going to hold-off a bit. Adding more
language and charset support is slowly starting to show the limitations
of our legacy multi-byte charset supports, since I haven't really
touched these since the original implementation of Mozilla.
It might be time to start reviewing these parts of the code.
The test file contents comes from 'Μαρμότα' page on Wikipedia in Greek
(though since 2 letters are missing in this encoding, despite its
popularity for Greek, I had to be careful in choosing pieces of text
without such letters).
|
|
Probably broken in commit db836fa (I changed a bunch of print() with
sys.stderr.write()).
|
|
It will make it easier to follow any dependency change as it is kinda a
standard file in Python projects. Of course, it's not a dependency to
uchardet itself, only for the generation script (so for developers
only), which is why I put it inside the script/ folder.
|
|
Right now, each time we add new language or new charset support, we have
too many pieces of code not to forget to edit. The script
script/BuildLangModel.py will now take care of the main parts: listing
the sequence models, listing the generic language models and computing
the numbers for each listing.
Furthermore the script will now end with a TODO list of the parts which
are still to be done manually (2 functions to edit and a CMakeLists).
Finally the script now allows to give a list of languages to edit rather
of having to run it with languages one by one. It also allows 2 special
code: "none", which will retrain none of the languages, but will
re-generate only the new generated listings; and "all" which will
retrain all models (useful in particulare when we change the model
formats or usage and want to regenerate everything).
|
|
|
|
For UTF-8, ISO-8859-5 and WINDOWS-1251.
Test files' contents come from page 'Мрмот' on Wikipedia in Serbian.
|
|
For UTF-8, ISO-8859-5, WINDOWS-1251 and IBM855 encodings.
Test files' contents come from page 'Хибернација' on Wikipedia in
Macedonian.
|
|
This fixes the broken Russian test in Windows-1251 which once again gets
a much better score with Russian. Also this adds UTF-8 support.
Same as Bulgarian, I wonder why I had not regenerated this earlier.
The new UTF-8 test comes from the 'Сурки' page of Wikipedia in Russian.
Note that now this broke the test zh:gb18030 (the score for KOI8-R / ru
(0.766388) beats GB18030 / zh (0.700000)). I think I'll have to look a
bit closer at our GB18030 dedicated prober.
|
|
UTF-8 and Windows-1251 support for now.
This actually breaks ru:windows-1251 test but same as Bulgarian, I never
generated Russian models with my scripts, so the models we currently use
are quite outdated. It will obviously be a lot better once we have new
Russian models.
The test file contents comes from 'Бабак' page on Wikipedia in
Ukrainian.
|
|
Support for UTF-8, Windows-1251 and ISO-8859-5.
The test contents comes from page 'Суркі' on Wikipedia in Belarusian.
|
|
Not sure why we had the Bulgarian support but haven't recently updated
it (i.e. never with the model generation script, or so it seems),
especially with generic language models, allowing to have
UTF-8/Bulgarian support. Maybe I tested it some time ago and it was
getting bad results? Anyway now with all the recents updates on the
confidence computation, I get very good detection scores.
So adding support for UTF-8/Bulgarian and rebuilding other models too.
Also adding a test for ISO-8859-5/Bulgarian (we already had support, but
no test files).
The 2 new test files are text from page 'Мармоти' on Wikipedia in
Bulgarian language.
|
|
It could happen either when our character set table is wrong, but it
could also happen for when iconv has a bug with incomplete charset
tables. For instance, I was trying to implement IBM880 for #29, but
iconv was missing a few codepoints. For instance, it seems to think that
0x45 (є), 0.55 (ў), 0x74 (Ў) are meant to be illegal in IBM880 (and
possibly others), but the information we have seem to say they are
valid.
And Python does not support this character set at all.
This test will help discovering the issue earlier (rather than breaking
a few line later because `iconv` failed and returned an empty string,
making ord() fail with TypeError exception.
See: https://gitlab.freedesktop.org/uchardet/uchardet/-/issues/29#note_1691847
|
|
This is the same text, taken from this Wikipedia page, which was today's
page of honor on Wikipedia in Hebrew:
https://he.wikipedia.org/wiki/שתי מסכתות על ממשל מדיני
I put it in 2 variants, since IBM862 can be used in logical and visual
variants. The visual variant is just about inverting orders of letters
(per lines, while lines stay in proper order), so that's what I did.
Though note that the English title quoted in the text should likely not
have been reverted, but it doesn't matter too much since anyway these
are off-Hebrew alphabet and would trigger bad sequence score, whichever
their order. So I didn't bother fixing these.
|
|
Added in both visual and logical order since Wikipedia says:
> Hebrew text encoded using code page 862 was usually stored in visual
> order; nevertheless, a few DOS applications, notably a word processor
> named EinsteinWriter, stored Hebrew in logical order.
I am not using the nsHebrewProber wrapper (nameProber) for this new
support, because I am really unsure this is of any use. Our statistical
code based on letter and sequence usage should be more than enough to
detect both variants of Hebrew encoding already, and my testing show
that so far (with pretty outstanding score on actual Hebrew tests while
all the other probers return bad scores). This will have to be studied a
bit more later and maybe the whole nsHebrewProber might be deleted, even
for Windows-1255 charset.
I'm also cleaning a bit nsSBCSGroupProber::nsSBCSGroupProber() code by
incrementing a single index, instead of maintaining the indexes by hand
(otherwise each time we add probers in the middle, to keep them
logically gathered by languages, we have to manually increment dozens of
following probers).
|
|
While the expected charset name is still the first part of the test file
(until the first point character), the test name is all but the last
part (until the last point character). This will allow to have several
test files for a single charset.
In particular, I want 2 test files at least for Hebrew when it has a
visual and logical variant. So I could call these "ibm862.visual.txt"
and "ibm862.logical.txt" which both expect IBM862 as a result charset,
but test names will "he:ibm862.visual" and he:ibm862.logical"
respectively. Without this change, the test names would collide and
CMake would refuse these.
|
|
… and rebuild of models.
The scores are really not bad now, 0.896026 for Norwegian and 0.877947
for Danish. It looks like the last confidence computation changes I did
are really giving fruits!
|
|
|
|
|
|
I had this test file locally for some time now, but it was always
failing, and recognized as other languages until now. Thanks to the
recent confidence improvements with new frequent/rare ratios, it is
finally detected as English by uchardet!
|
|
|
|
… introduced in previous commit.
|
|
Additionally to the "frequent characters" concept, we add 2
sub-categories, which are the "very frequent characters" and "rare
characters". The former are usually just a few characters which are used
most of the time (like 3 or 4 characters used 40% of the time!), whereas
the later are often a dozen or more characters which are barely used a
few percents of the time, all together.
We use this additional concept to help distinguish very similar
languages, or languages whose frequent characters are a subset of
the ones from another language (typically English, whose alphabet is a
subset of many other European languages).
The mTypicalPositiveRatio is getting rid of, as it was anyway barely of
any use (it was 0.99-something for nearly all languages!). Instead we
get these 2 new ratios: veryFreqRatio and lowFreqRatio, and of course
the associated order counts to know which character are in these sets.
|
|
… language data left.
|
|
The previous model was most obviously wrong: all letters had the same
probability, even non-ASCII ones! Anyway this new model does make unit
tests a tiny bit better though the English detection is still weak (I
have more concepts which I want to experiment to get this better).
|
|
|
|
This was badly named as this function does not return candidates, but
the number of candidates (to be actually used in other API).
|
|
It currently recognizes as Danish/UTF-8 with 0.958 score, though
Norwegian/UTF-8 is indeed the second candidate with 0.911 (the third
candidate is far behind, Swedish/UTF-8 with 0.815). Before wasting time
tweaking models, there are more basic conceptual changes that I want to
implement first (it might be enough to change the results!). So let's
skip this test for now.
|
|
We were experiencing segmentation fault when processing long texts
because we were ending up trying to access out-of-range data (from
codePointBuffer). Verify when this will happen and process data to reset
the index before adding more code points.
|
|
Now that it has IBM865 support on the main branch and that I rebased,
this feature branch for the new API got broken too.
|
|
As I just rebased my branch about new language detection API, I needed
to re-generate Norwegian language models. Unfortunately it doesn't
detect UTF-8 Norwegian text, though not far off (it detects it as second
candidate with high 91% confidence; beaten by Danish UTF-8 with 94%
confidence unfortunately!).
Note that I also update the alphabet list for Norwegian as there were
too many letters in there (according to Wikipedia at least), so even
when training a model, we had some missing characters in the training
set.
|
|
|
|
|
|
Adding `auto_suggest=False` to the wikipedia.page() call because this
auto-suggest is completely broken, searching "mar ot" instead of
"marmot" or "ground hug" instead of "Groundhog" (this one is extra funny
but not so useful!). I actually wonder why it even needs to suggest
anything when the Wikipedia pages do actually exist! Anyway the script
BuildLangModel.py was very broken because of this, now it's better.
See: https://github.com/goldsmith/Wikipedia/issues/295
Also printing the error message when we discard a page, which helps
debugging.
|
|
Adding the found confidence, but also the confidence matched by the
expected (lang, charset) couple, and its candidate order, if it even
matched.
|
|
It allows to get some more info in Testing/Temporary/LastTest.log to
debug detection issues.
|
|
English detection is still quite crappy so I don't add a unit test yet.
Though I believe the detection being bad is mostly because of too much
shortcutting we are doing to go "fast". I should probably review this
whole part of the logics as well.
|
|
Some languages are not meant to have multibyte characters. For instance,
English would typically have none. Yet you can still have UTF-8 English
text (with a few special characters, or foreign words…). So anyway let's
make it less of a deal breaker.
To be even fairer, the whole logics is biased of course and I believe
that eventually we should get rid of these lines of code dropping
confidence on a number of character. This is a ridiculous rule (we base
on our whole logics on language statistics and suddenly we add some
weird rule with a completely random number). But for now, I'll keep this
as-is until we make the whole library even more robust.
|