This project is a part of the topic "A probabilistic approach towards handling data quality problems and imperfect data integration tasks" of the Data Science track.
This repository implements DuoSQL, a high-level query language and translation system for DuBio, along with an experiment pipeline for evaluation.
main.py: Core implementation of the DuoSQL-to-DuBio SQL compiler algorithm.test_duosql_postgresql.py: Provides a method to test DuoSQL queries by translating and sending them for evaluation to PostgreSQL. The retrieved result is displayed.
high_level_tests.py:- Contains the high-level DuoSQL queries used during the experiments.
- Contains translation testing - from high-level code to automatic.
experiment_runner.py: Generates Excel files for each experiment's test type and selected queries. There are 4 criteria: code lines, characters, level of complexity, and probabilistic constructs.experiment_visualizer.ipynb: Generates visualizations - bar charts and summary tables - based on the Excel files.experiment_results/: Contains output Excel files from theexperiment_runnerand subfolders with generated diagrams for each experiment from theexperiment_visualizer.
sql/:- Contains the DuBio SQL code for BDD-based aggregation functions.
- Includes table definitions and sample data for the experiments.
translations/: Holds manually and automatically generated DuBio SQL queries for the experiments.
performance_evaluation/:automatic_queries.py: Contains the Automatic queries with a modifiableLIMITclause used in the benchmark.manual_queries.py: Contains the Manual queries with a modifiableLIMITclause used in the benchmark.data_generator.py: Generates data and SQL code to insert it, and saves it tosql/performance_insert.sql. Does not transmit to PostgreSQL - everything must be executed manually with caution.performance_benchmark.ipynb: Contains the benchmark code that sends queries to PostgreSQL and measures and records their execution times.
- Python 3.10+
- A terminal or command prompt
Before running any part of the application, install required packages.
python -m venv .venv # Create virtual environment
# Activate the venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate
pip install -r requirements.txt # Install Python dependencies-
Copy
.env.exampleto.envin the main project directory. -
Fill the database credentials in the placeholders:
USER=<POSTGRES_USER> PASSWORD=<POSTGRES_PASSWORD> HOST=<POSTGRES_HOST> DATABASE=<POSTGRES_DB>