Conversation
Why these changes are being introduced: * A LibGuides class is need to omit the contributor field which contained irrelevant names How this addresses that need: * Add LibGuides subclass of SpringshareOaiDc class * Update LibGuides.get_contributors method to return None * Remove LibGuides references in SpringshareOaiDC class * Shift LibGuides tests to new unit test module * Remove ignore from Makefile safety command Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-239
* Replace libguides fixtures with cool-repo to prevent future issues if LibGuides class is modified * Add transform class field to _test_config fixture * Update fixtures calls in unit tests to new name
Why these changes are being introduced: * Certain LibGuides need to be excluded based on a list. How this addresses that need: * Add boto3 and boto3-stubs to Pipfile * Add mock_s3 fixture * Add exclusion_list_path click option * Update Config.libguides to call LibGuides class * Add load_exclusion_list function to helpers.py and corresponding unit test * Add exclusion_list_path and exclusion_list attributes to Transformer class and update load method signature * Add record_is_excluded method to Transformer class * Update Transformer.transform method with record_is_excluded check and associated exception handling and logging * Add LibGuides.record_is_excluded method Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-238
jonavellecuerdo
left a comment
There was a problem hiding this comment.
@ehanson8 This is looking great! I have one recommended change and will wait for your update related to this comment.
ghukill
left a comment
There was a problem hiding this comment.
Overall, looking great. As noted in a comment, I think it was really smart to keep the exclusion logic focused to a single CSV, single column, no headers, etc.; put the complexity in the transformer class to interpret that data as needed.
Requesting changes for small stuff, mostly just implementation details of the approach which I overall definitely support.
transmogrifier/helpers.py
Outdated
| """ | ||
| Load a CSV file from path (S3 or local filesystem) and return values as a list. | ||
|
|
||
| CSV file has no headers and contains identifiers to exclude, one per line. | ||
|
|
||
| Args: | ||
| exclusion_list_path: Path to exclusion list file (s3://bucket/key or local path). | ||
| """ |
There was a problem hiding this comment.
I'd like to register my hesistance to only supporting a simple, flat, single column of data for the exclusion.... with the important qualifier that I do think it's the simplest and most straight-forward way to support the current need of:
- skipping
libguidesby either URL or identifier - skipping
mitlibwebsitevia URL
If we want to explore more complex skipping behavior in the future, can always revisit, but this is a smart and precise way to support the need now.
There was a problem hiding this comment.
Noted, and we can deal with a more complicated exclusion scenario whenever it presents itself
* Split mock_s3 fixture into a base fixture and a more targeted fixture and update calls in unit tests * Shift load_exclusion_list to Transformer method as well as corresponding unit test * Update type hinting for exclusion_list_path * Add exclusion_list property to handle logic for list loading * Update LibGuides.record_is_deleted method to handle updated Transformer.exclusion_list property * Add explicit return None to LibGuides.get_contributors method
ghukill
left a comment
There was a problem hiding this comment.
Nice work on this! Looks like a great solution to our immediate problems.
Approving in advance, but we may want an additional commit that updates the transform command in the README: https://github.com/MITLibraries/transmogrifier?tab=readme-ov-file#transform. We're probably only 50/50 on updating this in CLIs, and open to pushback if this format is helpful or not, but it's what we have here.
Full approval though.
| @pytest.fixture(scope="session") | ||
| def mock_s3(): | ||
| with mock_aws(): | ||
| s3 = boto3.client("s3", region_name="us-east-1") | ||
| s3.create_bucket(Bucket="test-bucket") | ||
| yield s3 | ||
|
|
||
|
|
||
| @pytest.fixture(scope="session") | ||
| def mock_s3_exclusion_list(mock_s3): | ||
| mock_s3.put_object( | ||
| Bucket="test-bucket", | ||
| Key="libguides/config/libguides-exclusions.csv", | ||
| Body="https://libguides.mit.edu/excluded1\n" | ||
| "https://libguides.mit.edu/excluded2\n", | ||
| ) |
There was a problem hiding this comment.
:chef's kiss:
Thanks for going this route! Perhaps we won't need the extensibility this provides, but feels like a good pattern.
Honestly, this all reminds me how easy mocking S3 can be... and realizing there are other instances we could (should) have mocked S3 (ignoring this pattern piece even).
| @property | ||
| def exclusion_list(self) -> list[str] | None: | ||
| if self.exclusion_list_path and self._exclusion_list is None: | ||
| self._exclusion_list = self.load_exclusion_list() | ||
| return self._exclusion_list | ||
|
|
||
| def load_exclusion_list(self) -> list[str]: | ||
| """ | ||
| Load a CSV file from path (S3 or local filesystem) and return values as a list. | ||
|
|
||
| CSV file has no headers and contains identifiers to exclude, one per line. | ||
|
|
||
| Args: | ||
| exclusion_list_path: Path to exclusion list file (s3://bucket/key or local | ||
| path). | ||
| """ | ||
| with smart_open.open(self.exclusion_list_path, "r") as exclusion_list: | ||
| rows = exclusion_list.readlines() | ||
| exclusion_list = [row.strip() for row in rows if row.strip()] | ||
| logger.info( | ||
| f"Loaded exclusion list from {self.exclusion_list_path} with " | ||
| f"{len(exclusion_list)} entries" | ||
| ) | ||
| logger.debug(exclusion_list) | ||
| return exclusion_list |
* Add exclusion-list-path to transform's option list
Purpose and background context
This PR addresses both USE-238 and USE-239. A follow-up ticket will be created to update timdex-pipeline-lambdas to include the
exclusion-list-pathCLI option for the appropriate sources.How can a reviewer manually see the effects of these changes?
Dev1credentialsNote that no records were skipped:
Note that a record has been skipped based on the exclusion list (which only contained 1 item):
Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES
What are the relevant tickets?
Code review