Add pt-BR thesaurus collateral information

Change-Id: I85d008af2a584ce10459407ce9e99fe01f354276 Reviewed-on: https://gerrit.libreoffice.org/c/dictionaries/+/124768 Tested-by: Olivier Hallot <olivier.hallot@libreoffice.org> Reviewed-by: Olivier Hallot <olivier.hallot@libreoffice.org>
author: Olivier Hallot <olivier.hallot@libreoffice.org> 2021-11-05 13:12:11 -0300
committer: Olivier Hallot <olivier.hallot@libreoffice.org> 2021-11-05 17:14:43 +0100
commit: 12c4f7058e7b2d2861ed870ca4c7b46fe3e1f63a (patch)
tree: 50937df3717cc7cf1674200805b04f74294ed204
parent: 60f4a1dd6b3e8ea15f65488b02487ea95fd1f8e6 (diff)
3 files changed, 222 insertions, 0 deletions
diff --git a/pt_BR/license-thes.readme b/pt_BR/license-thes.readme
new file mode 100644
index 0000000..b6bf70a
--- /dev/null
+++ b/pt_BR/license-thes.readme
@@ -0,0 +1,34 @@
+/*
+ * Copyright 2003 Kevin B. Hendricks, Stratford, Ontario, Canada
+ * And Contributors.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * 3. All modifications to the source code must be clearly marked as
+ *    such.  Binary redistributions based on modified source code
+ *    must be clearly marked as modified versions in the documentation
+ *    and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY KEVIN B. HENDRICKS AND CONTRIBUTORS 
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL 
+ * KEVIN B. HENDRICKS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 
+ * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ */
diff --git a/pt_BR/th_gen_idx.pl b/pt_BR/th_gen_idx.pl
new file mode 100755
index 0000000..9bdab33
--- /dev/null
+++ b/pt_BR/th_gen_idx.pl
@@ -0,0 +1,59 @@
+#!/usr/bin/perl
+
+# perl program to take a thesaurus structured text data file
+# and create the proper sorted index file (.idx)
+#
+# typcially invoked as follows:
+# cat th_en_US_new.dat | ./th_gen_idx.pl > th_en_US_new.idx
+#
+
+sub by_entry {
+    my ($aent, $aoff) = split('\|',$a);
+    my ($bent, $boff) = split('\|',$b);
+    $aent cmp $bent;
+}
+
+# main routine
+my $ne = 0;       # number of entries in index
+my @tindex=();    # the index itself
+my $foffset = 0;  # file position offset into thesaurus
+my $rec="";       # current string and related pieces
+my $rl=0;         # misc string length
+my $entry="";     # current word being processed
+my $nm=0;         # number of meaning for the current word
+my $meaning="";   # current meaning and synonyms
+my $p;            # misc uses
+my $encoding;     # encoding used by text file
+
+# top line of thesaurus provides encoding
+$encoding=<STDIN>;
+$foffset = $foffset + length($encoding);
+chomp($encoding);
+
+# read thesaurus line by line
+# first line of every block is an entry and meaning count
+while ($rec=<STDIN>){
+    $rl = length($rec);
+    chomp($rec);
+    ($entry, $nm) = split('\|',$rec);
+    $p = 0;
+    while ($p < $nm) {
+        $meaning=<STDIN>;
+        $rl = $rl + length($meaning);
+        chomp($meaning);
+        $p++;
+    }
+    push(@tindex,"$entry|$foffset");
+    $ne++;
+    $foffset = $foffset + $rl;
+}
+
+# now we have all of the information
+# so sort it and then output the encoding, count and index data
+@tindex = sort by_entry @tindex;
+print STDOUT "$encoding\n";
+print STDOUT "$ne\n";
+foreach $one (@tindex) {
+    print STDOUT "$one\n";
+}
+
diff --git a/pt_BR/thes_data_layout_pt_BR.txt b/pt_BR/thes_data_layout_pt_BR.txt
new file mode 100644
index 0000000..93cfec1
--- /dev/null
+++ b/pt_BR/thes_data_layout_pt_BR.txt
@@ -0,0 +1,129 @@
+Descrição da estrutura dos dados necessários para MyThes
+-------------------------------------------------- ------
+
+MyThes é muito simples. Quase todos os "smarts" são realmente
+no próprio arquivo de dados do dicionário de sinônimos.
+
+O formato deste arquivo é o seguinte:
+
+- sem dados binários
+
+- o final da linha é uma nova linha '\n' e não um retorno de carro/avanços de linha
+
+- A linha 1 é uma string de caracteres que descreve a codificação usada para o arquivo. Cabe ao programa de chamada converter
+de e para esta codificação, se necessário.
+
+     ISO8859-1 é usado pelo arquivo th_en_US_new.dat.
+
+     Strings atualmente reconhecidas pelo OpenOffice.org são:
+
+     ISO8859-1
+     ISO8859-2
+     ISO8859-3
+     ISO8859-4
+     ISO8859-5
+     ISO8859-6
+     ISO8859-7
+     ISO8859-8
+     ISO8859-9
+     ISO8859-10
+     KOI8-R
+     CP-1251
+     ISO8859-14
+     ISCII-DEVANAGARI
+     UTF8
+
+
+- Todas as linhas remanescentes do arquivo seguem esta estrutura
+
+entrada|num_mean
+pos|syn1_mean|syn2|...
+.
+.
+.
+pos|mean_syn1|syn2|...
+
+
+Onde:
+
+   entrada - todas as versões em minúsculas da palavra ou frase que está sendo descrita
+   num_mean - número de significados para esta entrada
+
+   Há um significado por linha e cada significado é composto por
+
+   pos - classe gramatical ou outra descrição específica de significado
+   syn1_mean - sinônimo 1 também usado para descrever o próprio significado
+   syn2 - sinônimo 2 para esse significado etc.
+
+
+Para tornar isso ainda mais claro, aqui estão os dados reais para o
+entrada "simples".
+
+simples|9
+(adj)|simples|elementar|final|supersimplificado|simplista|simplex|simplificado|não analisável |
+não decomposto|não complicado|não sofisticado|fácil|simples|não subdividido
+(adj)|elementar|simples|não problemático|fácil
+(adj)|nua|mera|simples
+(adj)|infantil|olhos arregalados|olhos orvalhados|ingênuo|naif
+(adj)|estúpido|estúpido|simplório|retardado
+(adj)|simples|não subdividido|sem lóbulo|suave
+(adj)|simples
+(substantivo)|erva|planta herbácea
+(substantivo)|simplório|pessoa|indivíduo|alguém|alguém|mortal|humano|alma
+
+
+Diz que "simples" tem 9 significados diferentes e cada
+o significado terá sua classe gramatical e pelo menos 1 sinônimo
+com outro se predefinido seguindo na mesma linha.
+
+
+
+Depois de criar seu próprio arquivo de texto estruturado, você pode usar
+o programa perl "th_gen_idx.pl" que pode ser encontrado neste
+diretório para criar um arquivo de índice que é usado para buscar em
+seu arquivo de dados pelo código MyThes.
+
+A maneira correta de executar o programa perl é a seguinte:
+
+cat th_en_US_new.dat|./th_gen_idx.pl> th_en_US_new.idx
+
+
+
+Então, se você liderar o arquivo de índice resultante, deverá ver o
+Segue:
+
+ISO8859-1
+142689
+'capô|10
+Gravenhage de|88
+'tween|173
+'tween decks|196
+.22|231
+.22 calibre|319
+.22 calibre|365
+Calibre 38|411
+Calibre 38|457
+Calibre .45|503
+Calibre .45|549
+0|595
+1|666
+1 crônicas|6283
+1 esdras|6336
+
+
+A linha 1 é a mesma string de codificação tirada do
+arquivo de dados de dicionário de sinônimos estruturado.
+
+A linha 2 é uma contagem do número total de entradas
+em seu dicionário de sinônimos.
+
+Todas as linhas restantes são do formulário
+
+entrada|byte_offset_into_data_file_where_entry_is_found
+
+
+Isso é tudo que existe também.
+
+
+Kevin
+kevin.hendricks@sympatico.ca
author	Olivier Hallot <olivier.hallot@libreoffice.org>	2021-11-05 13:12:11 -0300
committer	Olivier Hallot <olivier.hallot@libreoffice.org>	2021-11-05 17:14:43 +0100
commit	12c4f7058e7b2d2861ed870ca4c7b46fe3e1f63a (patch)
tree	50937df3717cc7cf1674200805b04f74294ed204
parent	60f4a1dd6b3e8ea15f65488b02487ea95fd1f8e6 (diff)