Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
9f5273b
#1558 WIP: csv readers
alexfurmenkov Feb 25, 2026
c791d7c
#1558 csv metadata reader and tables filtering
alexfurmenkov Feb 26, 2026
1490237
#1558 example files
alexfurmenkov Feb 26, 2026
83d2704
#1558 moved csv metadata reader
alexfurmenkov Feb 26, 2026
4cbb541
Merge branch 'main' into 1558-csv-data-reader
alexfurmenkov Feb 26, 2026
8098eb4
#1558 unit tests for dataset filtering
alexfurmenkov Feb 27, 2026
514b147
#1558 unit tests for dataset readers
alexfurmenkov Feb 27, 2026
155a236
#1558 regression and changes in metadata reader logic to preserve data
alexfurmenkov Feb 27, 2026
b3e8974
Merge branch 'main' into 1558-csv-data-reader
alexfurmenkov Mar 2, 2026
c351222
#1558 added envvar options for [-r, -er, -lr, -ss, -v, -s, -l, -dxp, …
alexfurmenkov Mar 5, 2026
21b0d3e
#1558 added envvar option for -ct and -dv parameters.
alexfurmenkov Mar 5, 2026
695e1da
Merge branch 'main' into 1558-csv-data-reader
RamilCDISC Mar 5, 2026
0fc3bb1
Merge branch 'main' into 1558-csv-data-reader
RamilCDISC Mar 5, 2026
d38fe58
#1558 error handling while reading CSV
alexfurmenkov Mar 6, 2026
43a083f
#1654 InvalidCSVFormat renamed to InvalidCSVFile
alexfurmenkov Mar 11, 2026
eb1cca9
#1558 added dotenv load from dataset path or path to datasets
alexfurmenkov Mar 14, 2026
cbca55e
Merge branch 'main' into 1558-csv-data-reader
alexfurmenkov Mar 14, 2026
6dbad2e
#1558 returned -s and -v as required
alexfurmenkov Mar 16, 2026
3ce61b5
#1558 PR fixes
alexfurmenkov Mar 20, 2026
b2ee46a
#1558 fixed csv tests
alexfurmenkov Mar 23, 2026
51b7fef
Merge branch 'refs/heads/main' into 1558-csv-data-reader
alexfurmenkov Mar 23, 2026
33c2bc4
#1558 -dp csv handling improved - errors on multiple tables.csv, fixe…
alexfurmenkov Mar 23, 2026
f3567fa
#1558 added cli arguments for tables.csv and variables.csv, .env paths.
alexfurmenkov Mar 24, 2026
14afc1b
#1558 added kwargs to dataset metadata readers. fixed README.md
alexfurmenkov Mar 24, 2026
2126df6
Merge branch 'refs/heads/main' into 1558-csv-data-reader
alexfurmenkov Mar 24, 2026
4876aef
#1558 added positional arguments to test data
alexfurmenkov Mar 24, 2026
d863b8a
Merge branch 'refs/heads/main' into 1558-csv-data-reader
alexfurmenkov Mar 25, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 27 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,20 +137,23 @@ This will show the list of validation options.
```
-ca, --cache TEXT Relative path to cache files containing pre
loaded metadata and rules
-ps, --pool-size INTEGER Number of parallel processes for validation
-d, --data TEXT Path to directory containing data files
-ps, --pool-size INTEGER Number of parallel processes for validation
-dep, --dotenv-path Path to the .env file used to set environment variables.
-d, --data TEXT Path to directory containing data files.
DATA_DIR environment variable can be used to pass value.
-dp, --dataset-path TEXT Absolute path to dataset file. Can be specified multiple times.
-dxp, --define-xml-path TEXT Path to Define-XML
DATASET_PATH environment variable can be used to pass values separated by ':' on Unix and ';' for Windows.
-dxp, --define-xml-path TEXT Path to Define-XML. DEFINE environment variable can be used to pass value.
-l, --log-level [info|debug|error|critical|disabled|warn]
Sets log level for engine logs, logs are
disabled by default
-rt, --report-template TEXT File path of report template to use for
excel output
-s, --standard TEXT CDISC standard to validate against
-s, --standard TEXT CDISC standard to validate against. STANDARD environment variable can be used to pass value.
[required]
-v, --version TEXT Standard version to validate against
-v, --version TEXT Standard version to validate against. VERSION environment variable can be used to pass value.
[required]
-ss, --substandard TEXT Substandard to validate against
-ss, --substandard TEXT Substandard to validate against. SUBSTANDARD environment variable can be used to pass value.
"SDTM", "SEND", "ADaM", or "CDASH"
[required for TIG]
-uc, --use-case TEXT Use Case for TIG Validation
Expand All @@ -161,7 +164,8 @@ This will show the list of validation options.
against, can provide more than one
NOTE: if a defineXML is provided, if it is version 2.1
engine will use the CT laid out in the define. If it is
version 2.0, -ct is expected to specify the CT package
version 2.0, -ct is expected to specify the CT package.
CONTROLLED_TERMINOLOGY_PACKAGE environment variable can be used to pass values separated by ':' on Unix and ';' for Windows.
-o, --output TEXT Report output file destination and name. Path will be
relative to the validation execution directory
and should end in the desired output filename
Expand Down Expand Up @@ -204,27 +208,30 @@ This will show the list of validation options.
if both .env and -me <limit> are specified, the larger value will be used. If either sets the per_dataset_flag to true, it will be true
If limit is set to 0, no maximum will be enforced.
No maximum is the default behavior.
-dv, --define-version TEXT Define-XML version used for validation
-dv, --define-version TEXT Define-XML version used for validation. DEFINE_VERSION environment variable can be used to pass value.
-dxp, --define-xml-path Path to define-xml file.
-vx, --validate-xml Enable XML validation (default 'y' to enable, otherwise disable).
--whodrug TEXT Path to directory with WHODrug dictionary
files
--meddra TEXT Path to directory with MedDRA dictionary
files
--loinc TEXT Path to directory with LOINC dictionary
--loinc TEXT Path to directory with LOINC dictionary
files
--medrt TEXT Path to directory with MEDRT dictionary
--medrt TEXT Path to directory with MEDRT dictionary
files
--unii TEXT Path to directory with UNII dictionary
--unii TEXT Path to directory with UNII dictionary
files
--snomed-version TEXT Version of snomed to use. (ex. 2024-09-01)
--snomed-url TEXT Base url of snomed api to use. (ex. https://snowstorm.snomedtools.org/snowstorm/snomed-ct)
--snomed-edition TEXT Edition of snomed to use. (ex. SNOMEDCT-US)
--snomed-version TEXT Version of snomed to use. (ex. 2024-09-01)
--snomed-url TEXT Base url of snomed api to use. (ex. https://snowstorm.snomedtools.org/snowstorm/snomed-ct)
--snomed-edition TEXT Edition of snomed to use. (ex. SNOMEDCT-US)
-r, --rules TEXT Specify rule core ID ex. CORE-000001. Can be specified multiple times.
RULES environment variable can be used to pass values separated by ':' on Unix and ';' for Windows.
-er, --exclude-rules TEXT Specify rule core ID to exclude, ex. CORE-000001. Can be specified multiple times.
EXCLUDE_RULES environment variable can be used to pass values separated by ':' on Unix and ';' for Windows.
-lr, --local-rules TEXT Specify relative path to directory or file containing
local rule yml and/or json rule files.
-cs, --custom-standard Adding this flag tells engine to use a custom standard specified with -s and -v
LOCAL_RULES environment variable can be used to pass values separated by ':' on Unix and ';' for Windows.
-cs, --custom-standard Adding this flag tells engine to use a custom standard specified with -s and -v
that has been uploaded to the cache using update-cache
-cse, --custom-standard-encoding TEXT
Explicitly specify the file encoding to use
Expand All @@ -243,6 +250,11 @@ This will show the list of validation options.
-jcf, --jsonata-custom-functions Pair containing a variable name and a Path to directory containing a set of custom JSONata functions. Can be specified multiple times
-e, --encoding TEXT File encoding for reading datasets. If not specified, defaults to utf-8. Supported encodings: utf-8, utf-16, utf-32, cp1252, latin-1, etc.
-ft, --filetype TEXT File extension to filter datasets. Has higher priority than --dataset-path parameter.
-vcp, --variables-csv-path Path to variables.csv. Used when multiple dataset paths are provided and refer to different folders.
Not required if variables.txt exists in all -dp directories.
VARIABLES_CSV environment variable can be used to pass value.
-tcp, --tables-csv-path Path to tables.csv. Required when multiple dataset paths are provided and refer to different folders.
TABLES_CSV environment variable can be used to pass value
--help Show this message and exit.
```

Expand Down
1 change: 1 addition & 0 deletions cdisc_rules_engine/enums/dataformat_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ class DataFormatTypes(BaseEnum):
USDM = "USDM"
XLSX = "XLSX"
XPT = "XPT"
CSV = "CSV"
5 changes: 5 additions & 0 deletions cdisc_rules_engine/exceptions/custom_exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,11 @@ class CTPackageNotFoundError(EngineError):
description = "Controlled terminology package(s) not found"


class InvalidCSVFile(EngineError):
code = 400
description = "CSV data is malformed."


class NumberOfAttemptsExceeded(EngineError):
pass

Expand Down
8 changes: 6 additions & 2 deletions cdisc_rules_engine/interfaces/data_reader_interface.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ def __init__(
self, dataset_implementation=PandasDataset, encoding: str = DEFAULT_ENCODING
):
"""
:param dataset_implementation DatasetInterface: The dataset type to return.
:param encoding str: The encoding to use when reading files. Defaults to DEFAULT_ENCODING (e.g. utf-8).
:param DatasetInterface dataset_implementation : The dataset type to return.
:param str encoding : The encoding to use when reading files. Defaults to DEFAULT_ENCODING (e.g. utf-8).
"""
self.dataset_implementation = dataset_implementation
self.encoding = encoding
Expand All @@ -26,3 +26,7 @@ def read(self, data):

def from_file(self, file_path):
raise NotImplementedError

def to_parquet(self, file_path) -> tuple[int, str]:
"""Returns number of rows and path to the parquet file"""
raise NotImplementedError
2 changes: 2 additions & 0 deletions cdisc_rules_engine/models/validation_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,5 +29,7 @@
"max_report_rows",
"max_errors_per_rule",
"encoding",
"variables_csv_path",
"tables_csv_path",
],
)
183 changes: 183 additions & 0 deletions cdisc_rules_engine/services/csv_metadata_reader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
import logging
from datetime import datetime
from pathlib import Path

import pandas as pd

from cdisc_rules_engine.constants import DEFAULT_ENCODING


class DatasetCSVMetadataReader:
def __init__(
self,
file_path: str,
file_name: str,
encoding: str = DEFAULT_ENCODING,
variables_csv_path: str = None,
tables_csv_path: str = None,
**kwargs,
):
self.file_path = file_path
self.file_name = file_name
self.encoding = encoding
self.variables_csv_path = (
Path(variables_csv_path)
if variables_csv_path
else Path(self.file_path).parent / "variables.csv"
)
self.tables_csv_path = (
Path(tables_csv_path)
if tables_csv_path
else Path(self.file_path).parent / "tables.csv"
)

def read(self) -> dict:
dataset_name = Path(self.file_name).stem.lower()

if not self.variables_csv_path.exists():
logger = logging.getLogger("validator")
logger.info("No variables file found for %s", dataset_name)
variables_meta = {}
else:
variables_meta = self.__get_variable_metadata(
dataset_name, self.variables_csv_path
)

metadata = {
"dataset_name": dataset_name.upper(),
"dataset_modification_date": datetime.fromtimestamp(
Path(self.file_path).stat().st_mtime
).isoformat(),
"adam_info": {
"categorization_scheme": {},
"w_indexes": {},
"period": {},
"selection_algorithm": {},
},
}
metadata.update(variables_meta)
metadata.update(self.__data_meta())
metadata.update(self.__dataset_label())
return metadata

def __get_variable_metadata(
self, dataset_name: str, variables_file_path: Path
) -> dict:
logger = logging.getLogger("validator")
try:
meta_df = pd.read_csv(variables_file_path, encoding=self.encoding)
except (UnicodeDecodeError, UnicodeError) as e:
logger.error(
f"Could not decode CSV file {variables_file_path} with {self.encoding} encoding: {e}. "
f"Please specify the correct encoding using the -e flag."
)
return {}
except Exception as e:
logger.error("Error reading CSV file %s. %s", self.file_path, e)
return {}

meta_df["dataset"] = meta_df["dataset"].apply(
lambda x: Path(str(x)).stem.lower()
)

dataset_meta_df = meta_df[meta_df["dataset"] == dataset_name]

if dataset_meta_df.empty:
logger = logging.getLogger("validator")
logger.info("No dataset metadata found for %s", dataset_name)
return {}

variable_names = dataset_meta_df["variable"].tolist()
variable_labels = dataset_meta_df["label"].tolist()

variable_name_to_label_map = dict(zip(variable_names, variable_labels))
variable_name_to_data_type_map = dict(
zip(variable_names, dataset_meta_df["type"])
)
variable_name_to_size_map = {
var: (int(length) if pd.notna(length) else None)
for var, length in zip(variable_names, dataset_meta_df["length"])
}
return {
"variable_names": variable_names,
"variable_labels": variable_labels,
"variable_formats": [""] * len(variable_names),
"variable_name_to_label_map": variable_name_to_label_map,
"variable_name_to_data_type_map": variable_name_to_data_type_map,
"variable_name_to_size_map": variable_name_to_size_map,
"number_of_variables": len(variable_names),
}

def __dataset_label(self) -> dict:
logger = logging.getLogger("validator")

if not self.tables_csv_path.exists():
return {}

try:
tables_df = pd.read_csv(self.tables_csv_path, encoding=self.encoding)
except (UnicodeDecodeError, UnicodeError) as e:
logger.error(
f"\n Error reading CSV from: {self.file_path}"
f"\n Failed to decode with {self.encoding} encoding: {e}"
f"\n Please specify the correct encoding using the -e flag."
)
return {}
except Exception as e:
logger.error("Error reading CSV file %s. %s", self.file_path, e)
return {}

if "Filename" not in tables_df.columns or "Label" not in tables_df.columns:
return {}

tables_df["dataset"] = tables_df["Filename"].apply(
lambda x: Path(str(x)).stem.lower()
)

current_dataset = Path(self.file_name).stem.lower()
match = tables_df[tables_df["dataset"] == current_dataset]

if match.empty:
return {}

return {"dataset_label": str(match.iloc[0]["Label"])}

def __data_meta(self):
logger = logging.getLogger("validator")
result = {
"dataset_length": 0,
"first_record": {},
}
try:
first_row_df = pd.read_csv(self.file_path, encoding=self.encoding, nrows=1)
except (UnicodeDecodeError, UnicodeError) as e:
logger.error(
f"\n Error reading CSV from: {self.file_path}"
f"\n Failed to decode with {self.encoding} encoding: {e}"
f"\n Please specify the correct encoding using the -e flag."
)
return result
except Exception as e:
logger.error("Error reading CSV file %s. %s", self.file_path, e)
return result

if not first_row_df.empty:
result["first_record"] = (
first_row_df.iloc[0].fillna("").astype(str).to_dict()
)

try:
with open(self.file_path, encoding=self.encoding) as f:
result["dataset_length"] = max(
sum(1 for _ in f) - 1, 0
) # subtract header
except (UnicodeDecodeError, UnicodeError) as e:
logger.error(
f"\n Error reading CSV from: {self.file_path}"
f"\n Failed to decode with {self.encoding} encoding: {e}"
f"\n Please specify the correct encoding using the -e flag."
)
except Exception as e:
logger.error("Error reading CSV file %s. %s", self.file_path, e)

return result
54 changes: 54 additions & 0 deletions cdisc_rules_engine/services/data_readers/csv_reader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import tempfile

from cdisc_rules_engine.exceptions.custom_exceptions import InvalidCSVFile
from cdisc_rules_engine.interfaces import DataReaderInterface
import pandas as pd


class CSVReader(DataReaderInterface):
def read(self, data):
"""
Function for reading data from a specific file type and returning a
pandas dataframe of the data.
"""
raise NotImplementedError

def from_file(self, file_path):
try:
with open(file_path, "r", encoding=self.encoding) as fp:
data = pd.read_csv(fp, sep=",", header=0, index_col=False)
return data
except (UnicodeDecodeError, UnicodeError) as e:
raise InvalidCSVFile(
f"\n Error reading CSV from: {file_path}"
f"\n Failed to decode with {self.encoding} encoding: {e}"
f"\n Please specify the correct encoding using the -e flag."
)
except Exception as e:
raise InvalidCSVFile(
f"\n Error reading CSV from: {file_path}"
f"\n {type(e).__name__}: {e}"
)

def to_parquet(self, file_path: str) -> tuple[int, str]:
temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".parquet")

dataset = pd.read_csv(file_path, chunksize=20000, encoding=self.encoding)

created = False
num_rows = 0

for chunk in dataset:
num_rows += len(chunk)

if not created:
chunk.to_parquet(temp_file.name, engine="fastparquet")
created = True
else:
chunk.to_parquet(temp_file.name, engine="fastparquet", append=True)

if not created:
empty_df = pd.read_csv(file_path, nrows=0, encoding=self.encoding)
empty_df.to_parquet(temp_file.name, engine="fastparquet")

return num_rows, temp_file.name
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function always returns a file path, but file is only written when csv is not empty. An empty csv can cause to return path to a file that is not yet created. It may cause some downstream errors.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added empty parquet file creation in case when csv was empty.
or should we raise value error in this scenario? should I also fix it in xpt reader?

Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
DataReaderInterface,
FactoryInterface,
)
from cdisc_rules_engine.services.data_readers.csv_reader import CSVReader
from cdisc_rules_engine.services.data_readers.xpt_reader import XPTReader
from cdisc_rules_engine.services.data_readers.dataset_json_reader import (
DatasetJSONReader,
Expand All @@ -19,12 +20,13 @@


class DataReaderFactory(FactoryInterface):
_reader_map = {
_reader_map: dict[str, Type[DataReaderInterface]] = {
DataFormatTypes.XPT.value: XPTReader,
DataFormatTypes.PARQUET.value: ParquetReader,
DataFormatTypes.JSON.value: DatasetJSONReader,
DataFormatTypes.NDJSON.value: DatasetNDJSONReader,
DataFormatTypes.USDM.value: JSONReader,
DataFormatTypes.CSV.value: CSVReader,
}

def __init__(
Expand Down
Loading
Loading