vectbase · mdianjun · Jul 17, 2021 · Jul 17, 2021 · Jul 17, 2021 · Jul 17, 2021
diff --git a/.gitmodules b/.gitmodules
@@ -221,3 +221,6 @@
 [submodule "contrib/NuRaft"]
 	path = contrib/NuRaft
 	url = https://github.com/ClickHouse-Extras/NuRaft.git
+[submodule "contrib/LucenePlusPlus"]
+	path = contrib/LucenePlusPlus
+	url =  https://github.com/cloudnativecube/LucenePlusPlus.git
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -490,6 +490,7 @@ include (cmake/find/rapidjson.cmake)
 include (cmake/find/fastops.cmake)
 include (cmake/find/odbc.cmake)
 include (cmake/find/rocksdb.cmake)
+include (cmake/find/luceneplusplus.cmake)
 include (cmake/find/libpqxx.cmake)
 include (cmake/find/nuraft.cmake)
 

diff --git a/README.md b/README.md
@@ -13,68 +13,3 @@ ClickHouse® is an open-source column-oriented database management system that a
 * [Code Browser](https://clickhouse.tech/codebrowser/html_report/ClickHouse/index.html) with syntax highlight and navigation.
 * [Contacts](https://clickhouse.tech/#contacts) can help to get your questions answered if there are any.
 * You can also [fill this form](https://clickhouse.tech/#meet) to meet Yandex ClickHouse team in person.
-
-
-## Neoway Research
-
-This branch is part of a research where we implemented a proof of concept for full text search using [ClickHouse](https://github.com/ClickHouse/ClickHouse) and [Tantivy](https://github.com/tantivy-search/tantivy).
-
-Tantivy is a full text search engine library written in Rust.
-
-The implementation consists in creating the tantivy storage engine and tantivy SQL function.
-Because this is just a test, we decided to hard code this three column names in the code so that we don't have to create all the logic behind dynamic column names with different types. It is hard-coded for columns `primary_id`, `secondary_id` and `body`. Then we can create the table using the query
-
-```sql
-CREATE TABLE fulltext_table
-(
-    primary_id UInt64,
-    secondary_id UInt64,
-    body String
-)
-ENGINE = Tantivy('/var/lib/clickhouse/tantivy/fulltext_table')
--- Tantivy engine takes as parameter a path to save the data.
-```
-
-For the [Storage Engine](https://github.com/NeowayLabs/ClickHouse/blob/fulltext-21.3/src/Storages/StorageTantivy.cpp) it has to be able to receive data from the INSERT query and index into tantivy. For the SELECT queries we need to push the full text WHERE clause to tantivy and create a Clickhouse column with the result.
-
-Because the full text search query needs to be sent to tantivy we created an SQL function named tantivy, so the syntax for making queries is the following
-```sql
-SELECT primary_id
-FROM fulltext_table
-WHERE tantivy('full text query here')
-```
-The `tantivy` SQL function doesn't return anything and has no logic inside. Its only purpose is to validade the input and generate the `ASTSelectQuery`.
-Inside the storage engine we take the AST parameters and push the query to the Rust implementation inside the folder [contrib/tantivysearch](https://github.com/NeowayLabs/ClickHouse/tree/fulltext-21.3/contrib/tantivysearch).
-
-When data is indexed in tantivy it needs to be commited. That's an expensive job to do every insert so we decided to call it when optimize table is called
-```sql
-OPTIMIZE TABLE fulltext_table FINAL
-```
-After the optimization the data is available for queries.
-
-## Results
-We inserted 39 million texts with an average of 4895 characters, also all the texts were unique. Our testing machine is a n2d-standard-16, 16 CPU, 62.8G Mem, 2 Local SSD 375 GB in RAID 0, on Google Cloud.
-
-In our implementation we were not interested in retrieving the actual text from the search result. That means we chose to return only the ID columns and don't return the text. It would be easy to return the text, but for our use case we just want to have statistics on the data. An example would be to answer how many rows match with the phrase 'covid 19' ? The result for that is a query that runs at the same speed tantivy would run with a little increment of time to copy the result to a Clickhouse column. For the majority of searches we could get the result in milliseconds. Queries using OR operator and matching almost all the texts were slower and could time more than 1 second.
-
-Another use case is that we have a table with dozens of columns that is related to our fulltext_table by an ID. So we would have a query like this
-```sql
-SELECT *
-FROM a_very_big_table
-WHERE
-    -- many_filters_here
-    AND primary_id IN (
-        SELECT primary_id
-        FROM fulltext_table
-        WHERE tantivy('full text query here')
-    )
-```
-Also we wanted to do many different queries, all with the same text filter, and running in parallel. Instead of doing the same query on tantivy at the same time, with the same result, we implemented a concurrent bounded cache mechanism that we can set a TTL and perform a single computation for multiple parallel queries on the same input resolving the same result to all once done. We noticed that the speed of those queries were fast making this solution very promising.
-
-
-## Alternatives
-Other alternatives to this is to use [data skipping indexes](https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-data_skipping-indexes) or implement something akin to an [inverted index](https://hannes.muehleisen.org/SIGIR2014-column-stores-ir-prototyping.pdf) on SQL directly.
-
-Data skipping indexes requires a lot of parameter tuning and it is very tricky to make they work with the SQL functions. Even with all that tuning we got very poor performance.
-
-Inverted index is an interesting solution, but it is very complex to implement and requires an external tokenizer and big complicated queries to search the data. The performance is better than data skipping indexes but still too slow for a real scenario.
diff --git a/cmake/find/luceneplusplus.cmake b/cmake/find/luceneplusplus.cmake
@@ -0,0 +1,49 @@
+option(ENABLE_LUCENE "Enable LUCENE" ${ENABLE_LIBRARIES})
+
+if (NOT ENABLE_LUCENE)
+    if (USE_INTERNAL_LUCENE_LIBRARY)
+        message (${RECONFIGURE_MESSAGE_LEVEL} "Can't use internal lucene library with ENABLE_LUCENE=OFF")
+    endif()
+    return()
+endif()
+
+option(USE_INTERNAL_LUCENE_LIBRARY "Set to FALSE to use system LUCENE library instead of bundled" ${NOT_UNBUNDLED})
+
+if (NOT EXISTS "${ClickHouse_SOURCE_DIR}/contrib/LucenePlusPlus/CMakeLists.txt")
+    if (USE_INTERNAL_LUCENE_LIBRARY)
+        message (WARNING "submodule contrib is missing. to fix try run: \n git submodule update --init --recursive")
+        message(${RECONFIGURE_MESSAGE_LEVEL} "cannot find internal lucene")
+    endif()
+    set (MISSING_INTERNAL_LUCENE 1)
+endif ()
+
+if (NOT USE_INTERNAL_LUCENE_LIBRARY)
+    find_library (LUCENE_LIBRARY lucene++)
+    find_path (LUCENE_INCLUDE_DIR NAMES lucene++/LuceneHeaders.h PATHS ${LUCENE_INCLUDE_PATHS})
+    if (NOT LUCENE_LIBRARY OR NOT LUCENE_INCLUDE_DIR)
+        message (${RECONFIGURE_MESSAGE_LEVEL} "Can't find system lucene library")
+    endif()
+
+    if (NOT ZLIB_LIBRARY)
+        include(cmake/find/zlib.cmake)
+    endif()
+
+    if(ZLIB_LIBRARY)
+        list (APPEND LUCENE_LIBRARY ${ZLIB_LIBRARY})
+    else()
+        message (${RECONFIGURE_MESSAGE_LEVEL}
+                 "Can't find system lucene: zlib=${ZLIB_LIBRARY} ;")
+    endif()
+endif ()
+
+if(LUCENE_LIBRARY AND LUCENE_INCLUDE_DIR)
+    set(USE_LUCENE 1)
+elseif (NOT MISSING_INTERNAL_LUCENE)
+    set (USE_INTERNAL_LUCENE_LIBRARY 1)
+
+    set (LUCENE_INCLUDE_DIR "${ClickHouse_SOURCE_DIR}/contrib/LucenePlusPlus/include")
+    set (LUCENE_LIBRARY "lucene++")
+    set (USE_LUCENE 1)
+endif ()
+
+message (STATUS "Using LUCENE=${USE_LUCENE}: ${LUCENE_INCLUDE_DIR} : ${LUCENE_LIBRARY}")
diff --git a/contrib/CMakeLists.txt b/contrib/CMakeLists.txt
@@ -309,7 +309,7 @@ if (USE_INTERNAL_ROCKSDB_LIBRARY)
     add_subdirectory(rocksdb-cmake)
 endif()
 
-add_subdirectory(tantivysearch-cmake)
+add_subdirectory(LucenePlusPlus)
 
 if (USE_LIBPQXX)
     add_subdirectory (libpq-cmake)

diff --git a/contrib/LucenePlusPlus b/contrib/LucenePlusPlus
diff --git a/contrib/boost b/contrib/boost
diff --git a/contrib/boost-cmake/CMakeLists.txt b/contrib/boost-cmake/CMakeLists.txt
@@ -13,6 +13,8 @@ if (NOT USE_INTERNAL_BOOST_LIBRARY)
         regex
         context
         coroutine
+        date_time
+        thread
     )
 
     if(Boost_INCLUDE_DIR AND Boost_FILESYSTEM_LIBRARY AND Boost_FILESYSTEM_LIBRARY AND
@@ -32,6 +34,8 @@ if (NOT USE_INTERNAL_BOOST_LIBRARY)
         add_library (_boost_system INTERFACE)
         add_library (_boost_context INTERFACE)
         add_library (_boost_coroutine INTERFACE)
+        add_library (_boost_date_time INTERFACE)
+        add_library (_boost_thread INTERFACE)
 
         target_link_libraries (_boost_filesystem INTERFACE ${Boost_FILESYSTEM_LIBRARY})
         target_link_libraries (_boost_iostreams INTERFACE ${Boost_IOSTREAMS_LIBRARY})
@@ -40,6 +44,8 @@ if (NOT USE_INTERNAL_BOOST_LIBRARY)
         target_link_libraries (_boost_system INTERFACE ${Boost_SYSTEM_LIBRARY})
         target_link_libraries (_boost_context INTERFACE ${Boost_CONTEXT_LIBRARY})
         target_link_libraries (_boost_coroutine INTERFACE ${Boost_COROUTINE_LIBRARY})
+        target_link_libraries (_boost_date_time INTERFACE ${Boost_DATE_TIME_LIBRARY})
+        target_link_libraries (_boost_thread INTERFACE ${Boost_THREAD_LIBRARY})
 
         add_library (boost::filesystem ALIAS _boost_filesystem)
         add_library (boost::iostreams ALIAS _boost_iostreams)
@@ -48,6 +54,8 @@ if (NOT USE_INTERNAL_BOOST_LIBRARY)
         add_library (boost::system ALIAS _boost_system)
         add_library (boost::context ALIAS _boost_context)
         add_library (boost::coroutine ALIAS _boost_coroutine)
+        add_library (boost::date_time ALIAS _boost_date_time)
+        add_library (boost::thread ALIAS _boost_thread)
     else()
         set(EXTERNAL_BOOST_FOUND 0)
         message (${RECONFIGURE_MESSAGE_LEVEL} "Can't find system boost")
@@ -220,4 +228,33 @@ if (NOT EXTERNAL_BOOST_FOUND)
     add_library (boost::coroutine ALIAS _boost_coroutine)
     target_include_directories (_boost_coroutine PRIVATE ${LIBRARY_DIR})
     target_link_libraries(_boost_coroutine PRIVATE _boost_context)
+
+    # date_time
+
+    set (SRCS_DATE_TIME
+            ${LIBRARY_DIR}/libs/date_time/src/gregorian/date_generators.cpp
+            ${LIBRARY_DIR}/libs/date_time/src/gregorian/greg_month.cpp
+            ${LIBRARY_DIR}/libs/date_time/src/gregorian/greg_names.hpp
+            ${LIBRARY_DIR}/libs/date_time/src/gregorian/greg_weekday.cpp
+            ${LIBRARY_DIR}/libs/date_time/src/gregorian/gregorian_types.cpp
+            ${LIBRARY_DIR}/libs/date_time/src/posix_time/posix_time_types.cpp
+            )
+    add_library (_boost_date_time ${SRCS_DATE_TIME})
+    add_library (boost::date_time ALIAS _boost_date_time)
+    target_include_directories (_boost_date_time PRIVATE ${LIBRARY_DIR})
+    target_link_libraries(_boost_date_time PRIVATE _boost_context)
+
+    # thread
+
+    set (SRCS_THREAD
+            ${LIBRARY_DIR}/libs/thread/src/pthread/once.cpp
+            ${LIBRARY_DIR}/libs/thread/src/pthread/once_atomic.cpp
+            ${LIBRARY_DIR}/libs/thread/src/pthread/thread.cpp
+            ${LIBRARY_DIR}/libs/thread/src/future.cpp
+            ${LIBRARY_DIR}/libs/thread/src/tss_null.cpp
+            )
+    add_library (_boost_thread ${SRCS_THREAD})
+    add_library (boost::thread ALIAS _boost_thread)
+    target_include_directories (_boost_thread PRIVATE ${LIBRARY_DIR})
+    target_link_libraries(_boost_thread PRIVATE _boost_context _boost_date_time)
 endif ()
diff --git a/contrib/tantivysearch-cmake/CMakeLists.txt b/contrib/tantivysearch-cmake/CMakeLists.txt
diff --git a/contrib/tantivysearch/.gitignore b/contrib/tantivysearch/.gitignore