lookup: return all matching entries found during lookup

Previously, we would just return the first entry we found that matched
the requested word. This causes issues with dictionaries that have lots
of entries which can be found using the same search string. In these
cases, the user got a completely arbitrary word returned to them rather
than the full set.

While this may seem strange, this is incredibly commonplace in Japanese
and likely several other languages. In Japanese:

 * When written using kanji, the same string of characters could refer
   to more than one word which may have a completely different meaning.
   Examples include 潜る (くぐる、もぐる) and 辛い (からい、つらい).

 * When written in kana, the same string of characters can also refer to
   more than one word which is written using completely different kanji,
   and has a completely different meaning. Examples include きく
   (聞く、効く、菊) and たつ (立つ、建つ、絶つ).

In both cases, these are different words in every sense of the word, and
have separate headwords for each in the dictionary. Thus in order to be
completely useful for such dictionaries, sdcv needs to be able to return
every matching word in the dictionary.

The solution is conceptually simple -- return a set containing the
indices rather than just a single index. Since every list we search is
sorted (to allow binary searching), once we find one match we can just
walk backwards and forwards from the match point to find the entire
block of matching terms and add them to the set in linear time. A
std::set is used so that we don't return duplicate results needlessly.

This solution was in practice a bit more complicated because .otf cache
files require a bit more fiddling, and also the ->lookup methods are
used by some callers to find the next entry if no entry was found. But
on the whole it's not too drastic of a change from the previous setup.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This commit is contained in:
Aleksa Sarai
2021-10-18 17:00:59 +11:00
committed by Evgeniy Dushistov
parent 3d15ce3b07
commit 6d385221d0
3 changed files with 180 additions and 119 deletions

View File

@@ -6,6 +6,7 @@
#include <list>
#include <map>
#include <memory>
#include <set>
#include <string>
#include <vector>
@@ -96,7 +97,11 @@ public:
virtual const gchar *get_key(glong idx) = 0;
virtual void get_data(glong idx) = 0;
virtual const gchar *get_key_and_data(glong idx) = 0;
virtual bool lookup(const char *str, glong &idx) = 0;
virtual bool lookup(const char *str, std::set<glong> &idxs, glong &next_idx) = 0;
virtual bool lookup(const char *str, std::set<glong> &idxs) {
glong unused_next_idx;
return lookup(str, idxs, unused_next_idx);
};
};
class SynFile
@@ -105,7 +110,8 @@ public:
SynFile() {}
~SynFile() {}
bool load(const std::string &url, gulong wc);
bool lookup(const char *str, glong &idx);
bool lookup(const char *str, std::set<glong> &idxs, glong &next_idx);
bool lookup(const char *str, std::set<glong> &idxs);
const gchar *get_key(glong idx) { return synlist[idx]; }
private:
@@ -137,7 +143,11 @@ public:
*offset = idx_file->wordentry_offset;
*size = idx_file->wordentry_size;
}
bool Lookup(const char *str, glong &idx);
bool Lookup(const char *str, std::set<glong> &idxs, glong &next_idx);
bool Lookup(const char *str, std::set<glong> &idxs) {
glong unused_next_idx;
return Lookup(str, idxs, unused_next_idx);
}
bool LookupWithRule(GPatternSpec *pspec, glong *aIndex, int iBuffLen);
@@ -188,12 +198,12 @@ public:
const gchar *poGetCurrentWord(glong *iCurrent);
const gchar *poGetNextWord(const gchar *word, glong *iCurrent);
const gchar *poGetPreWord(glong *iCurrent);
bool LookupWord(const gchar *sWord, glong &iWordIndex, int iLib)
bool LookupWord(const gchar *sWord, std::set<glong> &iWordIndices, int iLib)
{
return oLib[iLib]->Lookup(sWord, iWordIndex);
return oLib[iLib]->Lookup(sWord, iWordIndices);
}
bool LookupSimilarWord(const gchar *sWord, glong &iWordIndex, int iLib);
bool SimpleLookupWord(const gchar *sWord, glong &iWordIndex, int iLib);
bool LookupSimilarWord(const gchar *sWord, std::set<glong> &iWordIndices, int iLib);
bool SimpleLookupWord(const gchar *sWord, std::set<glong> &iWordIndices, int iLib);
bool LookupWithFuzzy(const gchar *sWord, gchar *reslist[], gint reslist_size);
gint LookupWithRule(const gchar *sWord, gchar *reslist[]);