Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -221,3 +221,6 @@
[submodule "contrib/NuRaft"]
path = contrib/NuRaft
url = https://github.com/ClickHouse-Extras/NuRaft.git
[submodule "contrib/LucenePlusPlus"]
path = contrib/LucenePlusPlus
url = https://github.com/cloudnativecube/LucenePlusPlus.git
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -490,6 +490,7 @@ include (cmake/find/rapidjson.cmake)
include (cmake/find/fastops.cmake)
include (cmake/find/odbc.cmake)
include (cmake/find/rocksdb.cmake)
include (cmake/find/luceneplusplus.cmake)
include (cmake/find/libpqxx.cmake)
include (cmake/find/nuraft.cmake)

Expand Down
65 changes: 0 additions & 65 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,68 +13,3 @@ ClickHouse® is an open-source column-oriented database management system that a
* [Code Browser](https://clickhouse.tech/codebrowser/html_report/ClickHouse/index.html) with syntax highlight and navigation.
* [Contacts](https://clickhouse.tech/#contacts) can help to get your questions answered if there are any.
* You can also [fill this form](https://clickhouse.tech/#meet) to meet Yandex ClickHouse team in person.


## Neoway Research

This branch is part of a research where we implemented a proof of concept for full text search using [ClickHouse](https://github.com/ClickHouse/ClickHouse) and [Tantivy](https://github.com/tantivy-search/tantivy).

Tantivy is a full text search engine library written in Rust.

The implementation consists in creating the tantivy storage engine and tantivy SQL function.
Because this is just a test, we decided to hard code this three column names in the code so that we don't have to create all the logic behind dynamic column names with different types. It is hard-coded for columns `primary_id`, `secondary_id` and `body`. Then we can create the table using the query

```sql
CREATE TABLE fulltext_table
(
primary_id UInt64,
secondary_id UInt64,
body String
)
ENGINE = Tantivy('/var/lib/clickhouse/tantivy/fulltext_table')
-- Tantivy engine takes as parameter a path to save the data.
```

For the [Storage Engine](https://github.com/NeowayLabs/ClickHouse/blob/fulltext-21.3/src/Storages/StorageTantivy.cpp) it has to be able to receive data from the INSERT query and index into tantivy. For the SELECT queries we need to push the full text WHERE clause to tantivy and create a Clickhouse column with the result.

Because the full text search query needs to be sent to tantivy we created an SQL function named tantivy, so the syntax for making queries is the following
```sql
SELECT primary_id
FROM fulltext_table
WHERE tantivy('full text query here')
```
The `tantivy` SQL function doesn't return anything and has no logic inside. Its only purpose is to validade the input and generate the `ASTSelectQuery`.
Inside the storage engine we take the AST parameters and push the query to the Rust implementation inside the folder [contrib/tantivysearch](https://github.com/NeowayLabs/ClickHouse/tree/fulltext-21.3/contrib/tantivysearch).

When data is indexed in tantivy it needs to be commited. That's an expensive job to do every insert so we decided to call it when optimize table is called
```sql
OPTIMIZE TABLE fulltext_table FINAL
```
After the optimization the data is available for queries.

## Results
We inserted 39 million texts with an average of 4895 characters, also all the texts were unique. Our testing machine is a n2d-standard-16, 16 CPU, 62.8G Mem, 2 Local SSD 375 GB in RAID 0, on Google Cloud.

In our implementation we were not interested in retrieving the actual text from the search result. That means we chose to return only the ID columns and don't return the text. It would be easy to return the text, but for our use case we just want to have statistics on the data. An example would be to answer how many rows match with the phrase 'covid 19' ? The result for that is a query that runs at the same speed tantivy would run with a little increment of time to copy the result to a Clickhouse column. For the majority of searches we could get the result in milliseconds. Queries using OR operator and matching almost all the texts were slower and could time more than 1 second.

Another use case is that we have a table with dozens of columns that is related to our fulltext_table by an ID. So we would have a query like this
```sql
SELECT *
FROM a_very_big_table
WHERE
-- many_filters_here
AND primary_id IN (
SELECT primary_id
FROM fulltext_table
WHERE tantivy('full text query here')
)
```
Also we wanted to do many different queries, all with the same text filter, and running in parallel. Instead of doing the same query on tantivy at the same time, with the same result, we implemented a concurrent bounded cache mechanism that we can set a TTL and perform a single computation for multiple parallel queries on the same input resolving the same result to all once done. We noticed that the speed of those queries were fast making this solution very promising.


## Alternatives
Other alternatives to this is to use [data skipping indexes](https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-data_skipping-indexes) or implement something akin to an [inverted index](https://hannes.muehleisen.org/SIGIR2014-column-stores-ir-prototyping.pdf) on SQL directly.

Data skipping indexes requires a lot of parameter tuning and it is very tricky to make they work with the SQL functions. Even with all that tuning we got very poor performance.

Inverted index is an interesting solution, but it is very complex to implement and requires an external tokenizer and big complicated queries to search the data. The performance is better than data skipping indexes but still too slow for a real scenario.
49 changes: 49 additions & 0 deletions cmake/find/luceneplusplus.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
option(ENABLE_LUCENE "Enable LUCENE" ${ENABLE_LIBRARIES})

if (NOT ENABLE_LUCENE)
if (USE_INTERNAL_LUCENE_LIBRARY)
message (${RECONFIGURE_MESSAGE_LEVEL} "Can't use internal lucene library with ENABLE_LUCENE=OFF")
endif()
return()
endif()

option(USE_INTERNAL_LUCENE_LIBRARY "Set to FALSE to use system LUCENE library instead of bundled" ${NOT_UNBUNDLED})

if (NOT EXISTS "${ClickHouse_SOURCE_DIR}/contrib/LucenePlusPlus/CMakeLists.txt")
if (USE_INTERNAL_LUCENE_LIBRARY)
message (WARNING "submodule contrib is missing. to fix try run: \n git submodule update --init --recursive")
message(${RECONFIGURE_MESSAGE_LEVEL} "cannot find internal lucene")
endif()
set (MISSING_INTERNAL_LUCENE 1)
endif ()

if (NOT USE_INTERNAL_LUCENE_LIBRARY)
find_library (LUCENE_LIBRARY lucene++)
find_path (LUCENE_INCLUDE_DIR NAMES lucene++/LuceneHeaders.h PATHS ${LUCENE_INCLUDE_PATHS})
if (NOT LUCENE_LIBRARY OR NOT LUCENE_INCLUDE_DIR)
message (${RECONFIGURE_MESSAGE_LEVEL} "Can't find system lucene library")
endif()

if (NOT ZLIB_LIBRARY)
include(cmake/find/zlib.cmake)
endif()

if(ZLIB_LIBRARY)
list (APPEND LUCENE_LIBRARY ${ZLIB_LIBRARY})
else()
message (${RECONFIGURE_MESSAGE_LEVEL}
"Can't find system lucene: zlib=${ZLIB_LIBRARY} ;")
endif()
endif ()

if(LUCENE_LIBRARY AND LUCENE_INCLUDE_DIR)
set(USE_LUCENE 1)
elseif (NOT MISSING_INTERNAL_LUCENE)
set (USE_INTERNAL_LUCENE_LIBRARY 1)

set (LUCENE_INCLUDE_DIR "${ClickHouse_SOURCE_DIR}/contrib/LucenePlusPlus/include")
set (LUCENE_LIBRARY "lucene++")
set (USE_LUCENE 1)
endif ()

message (STATUS "Using LUCENE=${USE_LUCENE}: ${LUCENE_INCLUDE_DIR} : ${LUCENE_LIBRARY}")
2 changes: 1 addition & 1 deletion contrib/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ if (USE_INTERNAL_ROCKSDB_LIBRARY)
add_subdirectory(rocksdb-cmake)
endif()

add_subdirectory(tantivysearch-cmake)
add_subdirectory(LucenePlusPlus)

if (USE_LIBPQXX)
add_subdirectory (libpq-cmake)
Expand Down
1 change: 1 addition & 0 deletions contrib/LucenePlusPlus
Submodule LucenePlusPlus added at 460945
2 changes: 1 addition & 1 deletion contrib/boost
Submodule boost updated from ee24fa to eede62
37 changes: 37 additions & 0 deletions contrib/boost-cmake/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ if (NOT USE_INTERNAL_BOOST_LIBRARY)
regex
context
coroutine
date_time
thread
)

if(Boost_INCLUDE_DIR AND Boost_FILESYSTEM_LIBRARY AND Boost_FILESYSTEM_LIBRARY AND
Expand All @@ -32,6 +34,8 @@ if (NOT USE_INTERNAL_BOOST_LIBRARY)
add_library (_boost_system INTERFACE)
add_library (_boost_context INTERFACE)
add_library (_boost_coroutine INTERFACE)
add_library (_boost_date_time INTERFACE)
add_library (_boost_thread INTERFACE)

target_link_libraries (_boost_filesystem INTERFACE ${Boost_FILESYSTEM_LIBRARY})
target_link_libraries (_boost_iostreams INTERFACE ${Boost_IOSTREAMS_LIBRARY})
Expand All @@ -40,6 +44,8 @@ if (NOT USE_INTERNAL_BOOST_LIBRARY)
target_link_libraries (_boost_system INTERFACE ${Boost_SYSTEM_LIBRARY})
target_link_libraries (_boost_context INTERFACE ${Boost_CONTEXT_LIBRARY})
target_link_libraries (_boost_coroutine INTERFACE ${Boost_COROUTINE_LIBRARY})
target_link_libraries (_boost_date_time INTERFACE ${Boost_DATE_TIME_LIBRARY})
target_link_libraries (_boost_thread INTERFACE ${Boost_THREAD_LIBRARY})

add_library (boost::filesystem ALIAS _boost_filesystem)
add_library (boost::iostreams ALIAS _boost_iostreams)
Expand All @@ -48,6 +54,8 @@ if (NOT USE_INTERNAL_BOOST_LIBRARY)
add_library (boost::system ALIAS _boost_system)
add_library (boost::context ALIAS _boost_context)
add_library (boost::coroutine ALIAS _boost_coroutine)
add_library (boost::date_time ALIAS _boost_date_time)
add_library (boost::thread ALIAS _boost_thread)
else()
set(EXTERNAL_BOOST_FOUND 0)
message (${RECONFIGURE_MESSAGE_LEVEL} "Can't find system boost")
Expand Down Expand Up @@ -220,4 +228,33 @@ if (NOT EXTERNAL_BOOST_FOUND)
add_library (boost::coroutine ALIAS _boost_coroutine)
target_include_directories (_boost_coroutine PRIVATE ${LIBRARY_DIR})
target_link_libraries(_boost_coroutine PRIVATE _boost_context)

# date_time

set (SRCS_DATE_TIME
${LIBRARY_DIR}/libs/date_time/src/gregorian/date_generators.cpp
${LIBRARY_DIR}/libs/date_time/src/gregorian/greg_month.cpp
${LIBRARY_DIR}/libs/date_time/src/gregorian/greg_names.hpp
${LIBRARY_DIR}/libs/date_time/src/gregorian/greg_weekday.cpp
${LIBRARY_DIR}/libs/date_time/src/gregorian/gregorian_types.cpp
${LIBRARY_DIR}/libs/date_time/src/posix_time/posix_time_types.cpp
)
add_library (_boost_date_time ${SRCS_DATE_TIME})
add_library (boost::date_time ALIAS _boost_date_time)
target_include_directories (_boost_date_time PRIVATE ${LIBRARY_DIR})
target_link_libraries(_boost_date_time PRIVATE _boost_context)

# thread

set (SRCS_THREAD
${LIBRARY_DIR}/libs/thread/src/pthread/once.cpp
${LIBRARY_DIR}/libs/thread/src/pthread/once_atomic.cpp
${LIBRARY_DIR}/libs/thread/src/pthread/thread.cpp
${LIBRARY_DIR}/libs/thread/src/future.cpp
${LIBRARY_DIR}/libs/thread/src/tss_null.cpp
)
add_library (_boost_thread ${SRCS_THREAD})
add_library (boost::thread ALIAS _boost_thread)
target_include_directories (_boost_thread PRIVATE ${LIBRARY_DIR})
target_link_libraries(_boost_thread PRIVATE _boost_context _boost_date_time)
endif ()
15 changes: 0 additions & 15 deletions contrib/tantivysearch-cmake/CMakeLists.txt

This file was deleted.

1 change: 0 additions & 1 deletion contrib/tantivysearch/.gitignore

This file was deleted.

Loading