index_search_monorepo
Report Bug
-
Request Feature
- About the Project
- Built With
- Phases
- Project Set Up
- Content Structure
- Design
- Functionality
- Usage
- Tests
- Hosting
- Experiments
- Resources
This repository is a monorepo for all the Python code generated as part of the HathiTrust Index Search project. It contains multiple subprojects, each with its own functionality and purpose.
For example, the ht_search project is responsible for searching documents in Solr, while the ht_indexer
project is responsible for indexing data in Solr full-text index. There are other projects for
monitoring and tracking the indexing process.
The monorepo structure allows for better organization and management of shared code and dependencies. The monorepo structure is to maintain and supports collaborative development, and scale new projects and features.
-
Python development tools
- Poetry—Dependency management and packaging tool for Python
- Pytest—Testing framework
- Mypy—Static type checker
- Ruff—Multi-purpose tool that combines linting (including docstring checks) and formatter for Python code
The project is divided into several phases, each focusing on different aspects of the indexing and searching process.
- Phase 1: Create a monorepo merging
ht_searchandht_indexerprojects. The script here has used to keep the previous commit history of both projects. - Phase 1.1: Create a Docker image for the monorepo, which includes all the necessary dependencies and configurations.
- Phase 1.2: Structure the monorepo to include shared libraries and projects, making it easier to manage dependencies and code reuse.
- Phase 1.3: Structure
ht_indexerto ensure all the features are working as expected with the new monorepo structure. - Phase 2: Set up a CI/CD pipeline to automate testing, and deployment for
ht_indexerproject. - Phase 3: Structure
ht_searchproject to ensure all the features are working as expected with the new monorepo structure. - Phase 4: Set up a CI/CD pipeline to automate testing, and deployment for
ht_searchproject. - Phase 5: Repeat the process for other projects in the monorepo, ensuring that each project is properly structured and tested.
- Phase 6: Install the monorepo in editable mode, allowing for real-time updates during development.
- Define dependencies in editable mode add
develop = true ==> {common-lib = {path = "../../libs/common_lib", develop = true} - This approach is useful to develop using the docker image, as it allows you to edit the code in the monorepo and see the changes reflected in the Docker container without having to rebuild the image every time.
- I don't know how to do this yet, but I will figure it out.
- Define dependencies in editable mode add
All the applications and library run in a docker container, and it is based on the python:3.11.0a7-slim-buster image. Their dependencies are managing to use poetry.
We use Makefile and Dockerfile to manage the environment set up and build the image simulating equivalent paths
in the Docker image and locally.
The deployment process is done through Docker image as showed below:
build:
# Copy project files
cp -r src $(TMP_DIR)/src
cp Dockerfile .dockerignore $(TMP_DIR)
cp pyproject.toml poetry.lock README.md $(TMP_DIR)
# Copy the shared package written in pyproject.toml
cp -r ../../libs/common_lib/ht_utils $(TMP_DIR)/ht_utils
cp -r ../../libs/common_lib/ht_search $(TMP_DIR)/ht_search
docker build -t $(IMAGE_NAME) $(TMP_DIR)
rm -rf $(TMP_DIR)
In the Makefile, a temporary directory is created with the project and its dependencies to build the local package.
Then, in Dockerfile, we copy the project files maintaining the same structure as in the temporary directory. Then, we
use a multistage Dockerfile to build the image. The first stage is used to install the dependencies, and the second
stage is used to copy the code into the image.
Steps to add a dependency:
In the Dockerfile,
- Copy the dependency on the base stage of the Docker image.
COPY ./ht_search /libs/ht_search/ - Copy the dependency on the final stage of the Docker image.
COPY --chown=${UID}:${GID} ht_search/ ht_search/ - In the pyproject.toml file, add the dependency to the [tool.poetry.dependencies] section.
ht-search = {path = "../../libs/ht_search"}
The design of the index_search_monorepo is structured to ensure uniformity between projects and to avoid duplicated
code.
- Monorepo Structure
The monorepo is organized into two main directories:
libsandapp.
libs: Contains shared libraries and utilities that are reused across multiple projects.app: Contains independent projects, each with its own functionality and dependencies.
- Shared Libraries
Shared librariesare placed in the libs directory (e.g., common_lib and ht_search). Each shared library has its ownpyproject.tomlfile for dependency management. These libraries are installed in non-editable mode, that does not allow real-time updates during development. Right now, we have a multistage Dockerfile that builds the image in two stages. The first stage is used to install the dependencies, and the second stage is used to copy the code into the image. On the second stage, we copy the code intosite-packages, then the code won't reflect the changes made in the monorepo unless we rebuild the image.
For simplicity, I decided to install dependencies using relative paths. It has been a challenge managing
the package paths locally and in the Docker image. As the relative paths become problematic, I have followed the process described
here and
use Makefile and Dockerfile to create the image simulating equivalent paths in the Docker image and locally.
- Independent Projects
Each project in the app directory is self-contained with its own
pyproject.toml,Dockerfile,srcandtestsdirectories. Projects can depend on shared libraries in the libs directory using relative paths.
To add a dependency, you must:
- Add the dependency to the
pyproject.tomlfile of the project. - Update the
Dockerfileto copy the dependency into the Docker image. - Install the dependency in non-editable mode using Poetry.
- Run tests to ensure everything works as expected.
-
Environment Set up The monorepo uses
MakefileandDockerfileto define clear steps for setting up the development environment and building Docker images. Docker images are used for deployment, ensuring consistency across environments. -
Testing Each project and shared library includes a
tests directoryfor unit tests.pytestis used as the testing framework, and tests can be run individually for each project or across the entire monorepo. -
CI/CD Integration The monorepo is designed to support CI/CD pipelines for automated testing and deployment. Each project can have its own pipeline configuration, ensuring independent development and deployment.
-
Versioning and Compatibility By using relative paths for dependencies, all projects share a single version of shared libraries, ensuring compatibility. Breaking changes in shared libraries are addressed across all dependent projects in a single pull request.
-
Scalability The modular design allows for the easy addition of new projects or shared libraries without disrupting the existing structure. The use of Docker ensures that new projects can be deployed independently.
How could we use git diff to detect the changes in the monorepo and decide what the service to deploy are?
Next step: create a script to run git diff command to identify the changed paths
If there are changes in the libs directory, we need to deploy all the services because all of them depend
on the shared libraries.
-
Clone the repo
git clone https://github.com/hathitrust/index_search_monorepo.git -
Set up a development environment with poetry
In your workdir,
* `poetry init` # It will set up your local environment and repository details
* `poetry env use python` # To find the virtual environment directory, created by poetry
* `source ~/index_search_monorepo-TUsF9qpC-py3.11/bin/activate` # Activate the virtual environment in Mac
* `C:\Users\user_name\AppData\Local\pypoetry\Cache\virtualenvs\index_search_monorepo-d4ARlKJT-py3.12\Scripts\Activate.ps1` # Activate the virtual environment in Windows
* ** Note **:
If you are using a Mac, poetry creates their files in the home directory, e.g. /Users/user_name/Library/Caches/pypoetry/.
If you are using Windows, poetry creates their files in the home directory, e.g. C:\Users\user_name\AppData\Local\pypoetry\
* `poetry env use python3.12` # Create a new virtual environment using poetry
To use the monorepo, follow these steps:
- Clone the repository:
git clone go.github.com/hathitrust/index_search_monorepo.git - Navigate to the project directory:
cd index_search_monorepo - Create the Docker image:
cd app/ht_indexer
make build
- Run the Docker container:
cd app/ht_indexer
make run
- Run the tests:
cd app/ht_indexer
make test
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Squash your commits (
git rebase -i HEAD~nwhere n is the number of commits you want to squash) - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
index_search_monorepo
├── README.md
├── Makefile
├── .gitignore
├── libs
├── common_lib
│ ├── ht_utils
│ │ ├── pyproject.toml
│ │ ├── __init__.py
│ │ ├── src
│ │ ├── tests
│ ├── ht_search
│ ├── pyproject.toml
│ ├── Dockerfile
│ ├── src
│ ├── ht_search
│ ├── solr_dataset
│ ├── indexing_data.sh
│ │ ├── tests
├── app
│ ├── ht_indexer
│ ├── pyproject.toml
│ ├── Dockerfile
│ ├── Makefile
│ ├── src
│ ├── ht_indexer_monitoring
│ ├── ht_indexer_tracktable.py
│ ├── tests
│ ├── ht_searcher
│ ├── pyproject.toml
│ ├── Dockerfile
│ ├── tests
│ ├── src
│ ├── ht_searcher
To update or install the dependencies of the monorepo, you can use the poetry update command in each project directory:
cd app/ht_indexer
poetry install
poetry run pytest
poetry update
- Follow these steps to run Ruff for a check on the code style and linting issues:
On the monorepo root directory, run the following commands:
`poetry env use python ` # To find the virtual environment directory, created by poetry
`source ~/index_search_monorepo-TUsF9qpC-py3.11/bin/activate` # Activate the virtual environment in Mac
`ruff check . --fix` # To check and fix the code style and linting issues
`mypy .` # To check the type hints and static typing issues- Use the command
. $env_name/bin/activateto activate the virtual environment inside the container $env_name is the name of the virtual environment created by poetry. - Enter inside the docker file:
docker compose exec full_text_searcher /bin/bash - Running the scripts:
docker compose exec full_text_searcher python ht_full_text_search/export_all_results.py --env dev --query '"good"'
Recommendation: Use brew to install python and pyenv to manage the python versions.
-
Install python
- You can read this blog to install python in the right way in python: https://opensource.com/article/19/5/python-3-default-mac
-
Install poetry:
- **Good blog to understand and use poetry **: https://blog.networktocode.com/post/upgrade-your-python-project-with-poetry/
- Poetry docs: https://python-poetry.org/docs/dependency-specification/
- **How to manage Python projects with Poetry **: https://www.infoworld.com/article/3527850/how-to-manage-python-projects-with-poetry.html
-
Useful poetry commands (Find more information about commands here)
- Inside the application folder: See the virtual environment used by the application
poetry env use python - Activate the virtual environment:
source ~/ht-indexer-GQmvgxw4-py3.11/bin/activate, in Mac poetry creates their files in the home directory, e.g. /Users/user_name/Library/Caches/pypoetry/.
- Inside the application folder: See the virtual environment used by the application
To upgrade the python version, you can use the following steps:
Local environment:
# Check the current python version
python --version
# Upgrade python to the latest version
brew install python@3.12
# Check the python version again
python --version
Update the Poetry project's Python version. You must update the pyproject.toml file of all the applications.
[tool.poetry.dependencies]
python = "^3.12"
Then, go into the folder of each application and run the following command to update the dependencies:
poetry env use python3.12
# Activate the virtual environment
source ~/index_search_monorepo-TUsF9qpC-py3.12/bin/activate
poetry lock
poetry install
Run tests to ensure everything works as expected:
python -m pytest
Docker environment:
Update the Dockerfile to use the new python version.
FROM python:3.12-slim-busterUpdate the Poetry project's Python version. You must update the pyproject.toml file of all the applications.
[tool.poetry.dependencies]
python = "^3.12"
Then, go into the folder of each application and run the following command to update the dependencies:
poetry env use python3.12
# Activate the virtual environment
source ~/index_search_monorepo-TUsF9qpC-py3.12/bin/activate
poetry lock
poetry install
Run the make build command to build the Docker image with the new python version.
To run the tests, you can use the following command in each project directory:
make test