Tropy's support for Chinese search is poor

bingling00 · October 21, 2022, 1:57pm

Tropy’s support for Chinese search is poor, Chinese rarely uses spaces, which results in Tropy only being able to search the first few Chinese words。
QQ20221021-215428

inukshuk · October 21, 2022, 8:47pm

Yes, unfortunately the tokenizer used for the full-text index requires explicit word boundaries such as spaces and punctuation. I hope we’ll be able to improve this and use SQLite’s ICU tokenizer in a future version. I would also like to add a way to do contains/suffix matching on the full-text index which would at least allow you to use a query like *男.

Furthermore, the current quick find functionality queries only the full-text indices. We’re going to add more advanced search functionality in one of the upcoming releases which will allow you to query values directly. While this is much slower, it will also allow you to search for phrases regardless of word boundaries.

So I’m afraid there is no quick fix right away, but the situation should improve with the advanced search feature.

bingling00 · October 22, 2022, 12:40am

Thanks🙏That would be nice

hency · August 9, 2023, 10:40am

I would second this request on improving Chinese characters searchability. It’s disheartening to find out that the Chinese texts I spent hours transcribing don’t turn out in the search due to the lack of spaces

inukshuk · August 9, 2023, 10:51am

We agree. This is still on our TODO list (in combination with some other SQLite related changes). The Signal Desktop app, which uses similar technology, recently solved this by adding a custom tokenizer for full-text search. We’re planning to use their extension as well, but I can’t give you a timeline for when we’ll get to it.