mirror of
https://github.com/Dushistov/sdcv.git
synced 2025-12-16 18:01:54 +00:00
0.4.2 release
This commit is contained in:
352
doc/DICTFILE_FORMAT
Normal file
352
doc/DICTFILE_FORMAT
Normal file
@@ -0,0 +1,352 @@
|
||||
Format for StarDict dictionary files
|
||||
------------------------------------
|
||||
|
||||
StarDict homepage: http://stardict.sourceforge.net
|
||||
|
||||
{0}. Number and Byte-order Conventions
|
||||
When you record the numbers that identify sizes, offsets, etc., you
|
||||
should use 32-bit numbers, such as you might represent with a glong.
|
||||
|
||||
In order to make StarDict work on different platforms, these numbers
|
||||
must be in network byte order. You can ensure the correct byte order
|
||||
by using the g_htonl() function when creating dictionary files.
|
||||
Conversely, you should use g_ntohl() when reading dictionary files.
|
||||
|
||||
Strings should be encoded in UTF-8.
|
||||
|
||||
|
||||
{1}. Files
|
||||
Every dictionary consists of three files:
|
||||
(1). somedict.ifo
|
||||
(2). somedict.idx or somedict.idx.gz
|
||||
(3). somedict.dict or somedict.dict.dz
|
||||
|
||||
You can use gzip -9 to compress the .idx file. If the .idx file are not
|
||||
compressed, the loading can be fast and save memory when using, compress it
|
||||
will make the .idx file load into memory and make the quering fast when using.
|
||||
|
||||
You can use dictzip to compress the .dict file.
|
||||
"dictzip" uses the same compression algorithm and file format as does gzip,
|
||||
but provides a table that can be used to randomly access compressed blocks
|
||||
in the file. The use of 50-64kB blocks for compression typically degrades
|
||||
compression by less than 10%, while maintaining acceptable random access
|
||||
capabilities for all data in the file. As an added benefit, files
|
||||
compressed with dictzip can be decompressed with gunzip.
|
||||
For more information about dictzip, refer to DICT project, please see:
|
||||
http://www.dict.org
|
||||
|
||||
Stardict will search for the .ifo file, then open the .idx or
|
||||
.idx.gz file and the .dict.dz or .dict file which is in the same directory and
|
||||
has the same base name.
|
||||
|
||||
|
||||
|
||||
{2}. The ".ifo" file's format.
|
||||
The .ifo file has the following format:
|
||||
|
||||
StarDict's dict ifo file
|
||||
version=2.4.2
|
||||
[options]
|
||||
|
||||
Note that the current "version" string must be "2.4.2". If it's not,
|
||||
then StarDict will refuse to read the file.
|
||||
|
||||
[options]
|
||||
---------
|
||||
In the example above, [options] expands to any of the following lines
|
||||
specifying information about the dictionary. Each option is a keyword
|
||||
followed by an equal sign, then the value of that option, then a
|
||||
newline. The options may be appear in any order.
|
||||
|
||||
Note that the dictionary must have at least a bookname, a wordcount and a
|
||||
idxfilesize, or the load will fail. All other information is optional. All
|
||||
strings should be encoded in UTF-8.
|
||||
|
||||
Available options:
|
||||
|
||||
bookname= // required
|
||||
wordcount= // required
|
||||
idxfilesize= // required
|
||||
author=
|
||||
email=
|
||||
website=
|
||||
description=
|
||||
date=
|
||||
sametypesequence= // very important.
|
||||
|
||||
|
||||
wordcount is the count of word entries in .idx file, it must be right.
|
||||
|
||||
idxfilesize is the size(in bytes) of the .idx file, even the .idx is compressed
|
||||
to a .idx.gz file, this entry must record the original .idx file's size, and it
|
||||
must be right too. The .gz file don't contain its original size information,
|
||||
but knowing the original size can speed up the extraction to memory, as you
|
||||
don't need to call realloc() for many times.
|
||||
|
||||
|
||||
The "sametypesequence" option is described in further detail below.
|
||||
|
||||
***
|
||||
sametypesequence
|
||||
|
||||
You should first familiarize yourself with the .dict file format
|
||||
described in the next section so that you can understand what effect
|
||||
this option has on the .dict file.
|
||||
|
||||
If the sametypesequence option is set, it tells StarDict that each
|
||||
word's data in the .dict file will have the same sequence of datatypes.
|
||||
In this case, we expect a .dict file that's been optimized in two
|
||||
ways: the type identifiers should be omitted, and the size marker for
|
||||
the last data entry of each word should be omitted.
|
||||
|
||||
Let's consider some concrete examples of the sametypesequence option.
|
||||
|
||||
Suppose that a dictionary records many .wav files, and so sets:
|
||||
sametypesequence=W
|
||||
In this case, each word's entry in the .dict file consists solely of a
|
||||
wav file. In the .dict file, you would leave out the 'W' character
|
||||
before each entry, and you would also omit the 32-bit integer at the
|
||||
front of each .wav entry that would normally give the entry's length.
|
||||
You can do this since the length is known from the information in the
|
||||
idx file.
|
||||
|
||||
As another example, suppose a dictionary contains phonetic information
|
||||
and a meaning for each word. The sametypesequence option for this
|
||||
dictionary would be:
|
||||
sametypesequence=tm
|
||||
Once again, you can omit the 't' and 'm' characters before each data
|
||||
entry in the .dict file. In addition, you should omit the terminating
|
||||
'\0' for the 'm' entry for each word in the .dict file, as the length
|
||||
of the meaning string can be inferred from the length of the phonetic
|
||||
string (still indicated by a terminating '\0') and the length of the
|
||||
entire word entry (listed in the .idx file).
|
||||
|
||||
So for cases where the last data entry for each word normally requires
|
||||
a terminating '\0' character, you should omit this character in the
|
||||
dict file. And for cases where the last data entry for each word
|
||||
normally requires an initial 32-bit number giving the length of the
|
||||
field (such as WAV and PNG entries), you must omit this number in the
|
||||
dictionary.
|
||||
|
||||
Every dictionary should try to use the sametypesequence feature to
|
||||
save disk space.
|
||||
***
|
||||
|
||||
|
||||
{3}. The ".idx" file's format.
|
||||
The .idx file is just a word list.
|
||||
|
||||
The word list is a sorted list of word entries.
|
||||
|
||||
Each entry in the word list contains three fields, one after the other:
|
||||
word_str; // a utf-8 string terminated by '\0'.
|
||||
word_data_offset; // word data's offset in .dict file
|
||||
word_data_size; // word data's total size in .dict file
|
||||
|
||||
word_str gives the string representing this word. It's the string
|
||||
that is "looked up" by the StarDict.
|
||||
|
||||
word_data_offset and word_data_size should both be 32-bit numbers in
|
||||
network byte order.
|
||||
|
||||
No two entries should have the same "word_str". In other words,
|
||||
(strcmp(s1, s2) != 0).
|
||||
|
||||
The length of "word_str" should be less than 256. In other words,
|
||||
(strlen(word) < 256).
|
||||
|
||||
The word list must be sorted by calling stardict_strcmp() on the "word_str"
|
||||
fields. If the word list order is wrong, StarDict will fail to function
|
||||
correctly!
|
||||
|
||||
============
|
||||
gint stardict_strcmp(const gchar *s1, const gchar *s2)
|
||||
{
|
||||
gint a;
|
||||
a = g_ascii_strcasecmp(s1, s2);
|
||||
if (a == 0)
|
||||
return strcmp(s1, s2);
|
||||
else
|
||||
return a;
|
||||
}
|
||||
============
|
||||
g_ascii_strcasecmp() is a glib function:
|
||||
Unlike the BSD strcasecmp() function, this only recognizes standard
|
||||
ASCII letters and ignores the locale, treating all non-ASCII characters
|
||||
as if they are not letters.
|
||||
|
||||
stardict_strcmp() works fine with English characters, but the other
|
||||
locale characters' sorting is not so good. There should be a _strcmp
|
||||
function which handles the utf-8 string sorting better. If you know
|
||||
one, email me :)
|
||||
|
||||
g_utf8_collate()? This is a locale-dependent funcition. So if you look
|
||||
up Chinese characters while in the Chinese locale, it works fine. But
|
||||
if you are in some other locale then the lookup will fail, as the
|
||||
order is not the same as in the Chinese locale (which was used when
|
||||
creating the dictionary).
|
||||
|
||||
g_utf8_to_ucs4() then do comparing? This sounds like a good solution, but..
|
||||
|
||||
The complete solution can be found in "Unicode Technical Standard #10: Unicode
|
||||
Collation Algorithm", http://www.unicode.org/reports/tr10/
|
||||
|
||||
I hope glib will provide a locale-independent g_utf8_collate() soon.
|
||||
http://bugzilla.gnome.org/show_bug.cgi?id=112798
|
||||
|
||||
|
||||
|
||||
{4}. The ".dict" file's format.
|
||||
The .dict file is a pure data sequence, as the offset and size of each
|
||||
word is recorded in the corresponding .idx file.
|
||||
|
||||
If the "sametypesequence" option is not used in the .ifo file, then
|
||||
the .dict file has fields in the following order:
|
||||
==============
|
||||
word_1_data_1_type; // a single char identifying the data type
|
||||
word_1_data_1_data; // the data
|
||||
word_1_data_2_type;
|
||||
word_1_data_2_data;
|
||||
...... // the number of data entries for each word is determined by
|
||||
// word_data_size in .idx file
|
||||
word_2_data_1_type;
|
||||
word_2_data_1_data;
|
||||
......
|
||||
==============
|
||||
It's important to note that each field in each word indicates its
|
||||
own length, as described below. The number of possible fields per
|
||||
word is also not fixed, and is determined by simply reading data until
|
||||
you've read word_data_size bytes for that word.
|
||||
|
||||
|
||||
Suppose the "sametypesequence" option is used in the .idx file, and
|
||||
the option is set like this:
|
||||
sametypesequence=tm
|
||||
Then the .dict file will look like this:
|
||||
==============
|
||||
word_1_data_1_data
|
||||
word_1_data_2_data
|
||||
word_2_data_1_data
|
||||
word_2_data_2_data
|
||||
......
|
||||
==============
|
||||
The first data entry for each word will have a terminating '\0', but
|
||||
the second entry will not have a terminating '\0'. The omissions of
|
||||
the type chars and of the last field's size information are the
|
||||
optimizations required by the "sametypesequence" option described
|
||||
above.
|
||||
|
||||
|
||||
Type identifiers
|
||||
----------------
|
||||
Here are the single-character type identifiers that may be used with
|
||||
the "sametypesequence" option in the .idx file, or may appear in the
|
||||
dict file itself if the "sametypesequence" option is not used.
|
||||
|
||||
Lower-case characters signify that a field's size is determined by a
|
||||
terminating '\0', while upper-case characters indicate that the data
|
||||
begins with a 32-bit integer that gives the length of the data field.
|
||||
|
||||
'm'
|
||||
Word's pure text meaning.
|
||||
The data should be a utf-8 string ending with '\0'.
|
||||
|
||||
'l'
|
||||
Word's pure text meaning.
|
||||
The data is NOT a utf-8 string, but is instead a string in locale
|
||||
encoding, ending with '\0'. Sometimes using this type will save disk
|
||||
space, but its use is discouraged.
|
||||
|
||||
'g'
|
||||
A utf-8 string which is marked up with the Pango text markup language.
|
||||
For more information about this markup language, See the "Pango
|
||||
Reference Manual."
|
||||
You might have it installed locally at:
|
||||
file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html
|
||||
|
||||
't'
|
||||
English phonetic string.
|
||||
The data should be a utf-8 string ending with '\0'.
|
||||
|
||||
Here are some utf-8 phonetic characters:
|
||||
θʃŋʧðʒæıʌʊɒɛəɑɜɔˌˈːˑ
|
||||
æɑɒʌәєŋvθðʃʒːɡˏˊˋ
|
||||
|
||||
'y'
|
||||
Chinese YinBiao.
|
||||
The data should be a utf-8 string ending with '\0'.
|
||||
|
||||
|
||||
'W'
|
||||
wav file.
|
||||
The data begins with a network byte-ordered glong to identify the wav
|
||||
file's size, immediately followed by the file's content.
|
||||
|
||||
'P'
|
||||
png file.
|
||||
The data begins with a network byte-ordered glong to identify the png
|
||||
file's size, immediately followed by the file's content.
|
||||
|
||||
'X'
|
||||
this type identifier is reserved for experimental extensions.
|
||||
|
||||
|
||||
{5}. Tree Dictionary
|
||||
The tree dictionary support is used for information viewing, etc.
|
||||
|
||||
A tree dictionary contains three file: sometreedict.ifo, sometreedict.tdx.gz
|
||||
and sometreedict.dict.dz.
|
||||
|
||||
It is better to compress the .tdx file, as it is always load into memory.
|
||||
|
||||
The .ifo file has the following format:
|
||||
|
||||
StarDict's treedict ifo file
|
||||
version=2.4.2
|
||||
[options]
|
||||
|
||||
Available options:
|
||||
|
||||
bookname= // required
|
||||
tdxfilesize= // required
|
||||
wordcount=
|
||||
author=
|
||||
email=
|
||||
website=
|
||||
description=
|
||||
date=
|
||||
sametypesequence=
|
||||
|
||||
wordcount is only used for info view in the dict manage dialog, so it is not
|
||||
important in tree dictionary.
|
||||
|
||||
The .tdx file is just the word list.
|
||||
-----------
|
||||
The word list is a tree list of word entries.
|
||||
|
||||
Each entry in the word list contains four fields, one after the other:
|
||||
word_str; // a utf-8 string terminated by '\0'.
|
||||
word_data_offset; // word data's offset in .dict file
|
||||
word_data_size; // word data's total size in .dict file. it can be 0.
|
||||
word_subentry_count; //have many sub word this entry has, 0 means none.
|
||||
|
||||
Subentry is immidiately followed by its parent entry. This make the order is
|
||||
just as when a tree list with all its nodes extended, then sort from top to
|
||||
bottom.
|
||||
|
||||
The .dict file's format is the same as the normal dictionary.
|
||||
|
||||
|
||||
|
||||
{6}. More information.
|
||||
You can read "src/lib.cpp", "src/dictmanagedlg.cpp" and
|
||||
"src/tools/*.cpp" for more information.
|
||||
|
||||
If you have any questions, email me. :)
|
||||
|
||||
Thanks to Will Robinson <wsr23@stanford.edu> for cleaning up this file's
|
||||
English.
|
||||
|
||||
Hu Zheng <huzheng_001@163.com>
|
||||
http://forlinux.yeah.net
|
||||
2003.11.11
|
||||
Reference in New Issue
Block a user