Add Deltalake query support #1023

tsmathis · 2025-10-23T15:42:27Z

should just work™️

tsmathis · 2025-10-23T15:56:25Z

some notes:

depends on Compatibility with emmet-core 0.86.0rc1 #1021
~~2. could supersede exclude gnome for full downloads if needed #974~~

this will need to be addressed for the progress bar to be accurate in the case were has_gnome_access=False:

Lines 573 to 579 in aee0f8c

    
           # TODO: Update tasks (+ others?) resource to have emmet-api BatchIdQuery operator 
        
           #   -> need to modify BatchIdQuery operator to handle root level 
        
           #      batch_id, not only builder_meta.batch_id 
        
           # if not has_gnome_access: 
        
           #     num_docs_needed = self.count( 
        
           #         {"batch_id_neq_any": SETTINGS.ACCESS_CONTROLLED_BATCH_IDS} 
        
           #     )

the count can be retrieved from s3, but the COUNT(*) ... WHERE NOT IN ... is slow

wasn't sure how to emit messages to the user, warnings might not be the best choice:

api/mp_api/client/core/client.py

Lines 532 to 535 in aee0f8c

    
           warnings.warn( 
        
               f"Dataset for {suffix} already exists at {target_path}, delete or move existing dataset " 
        
               "or re-run search query with MPRester(force_renew=True)", 
        
               MPLocalDatasetWarning,

api/mp_api/client/core/client.py

Lines 631 to 636 in aee0f8c

    
           warnings.warn( 
        
               f"Dataset for {suffix} written to {target_path}. It is recommended to optimize " 
        
               "the table according to your usage patterns prior to running intensive workloads, " 
        
               "see: https://delta-io.github.io/delta-rs/delta-lake-best-practices/#optimizing-table-layout", 
        
               MPLocalDatasetWarning, 
        
           )

On the fence if MPDataset should inherit user's choice of use_document_model or default to False, its extra overhead when True

api/mp_api/client/core/client.py

Line 642 in aee0f8c

use_document_model=self.use_document_model,
re: document model's, wasn't sure if making an MPDataDoc model was the right route so the emmet model is just passed through now.
@esoteric-ephemera, is this how coercing user input to AlphaIDs should go? Do you want to do something different?

api/mp_api/client/routes/materials/tasks.py

Line 34 in aee0f8c

as_alpha = str(AlphaID(task_id, padlen=8)).split("-")[-1]

Is MPAPIClientSettings the right place for these? Not sure if the user has the ability to adjust these if needed:

api/mp_api/client/core/settings.py

Lines 90 to 102 in aee0f8c

    
           LOCAL_DATASET_CACHE: str = Field( 
        
               os.path.expanduser("~") + "/mp_datasets", 
        
               description="Target directory for downloading full datasets", 
        
           ) 
        
           DATASET_FLUSH_THRESHOLD: int = Field( 
        
               100000, 
        
               description="Threshold number of rows to accumulate in memory before flushing dataset to disk", 
        
           ) 
        
           ACCESS_CONTROLLED_BATCH_IDS: list[str] = Field( 
        
               ["gnome_r2scan_statics"], description="Batch ids with access restrictions" 
        
           )

tsmathis · 2025-10-23T15:58:45Z

ah and based on the failing test for trajectories, I assumed returning the pymatgen object was correct, should the dict be returned? @esoteric-ephemera

api/mp_api/client/routes/materials/tasks.py

Line 57 in aee0f8c

return RelaxTrajectory(**traj_data[0]).to_pmg()

esoteric-ephemera · 2025-10-23T16:43:12Z

@tsmathis think the API was set up to return the jsanitized trajectory info:
https://github.com/materialsproject/emmet/blob/3447c5af4746d539f1f4faf26b97715cb119c85d/emmet-api/emmet/api/routes/materials/tasks/query_operators.py#L73

Either way yeah I guess it returned the as_dict but we don't need to keep with that paradigm

For the AlphaID, to handle either the no prefix/separator ("aaaaaaft") and with prefix/separator ("mp-aaaaaaft") cases, both of these should work, but I can also just save the "padded identifier" as an attr on it to make this cleaner - I'll do that in the PR you linked:

"a"*(x._padlen-len(x._identifier)) + x._identifier

or

if (alpha := AlphaID(task_id, padlen=8))._separator:
  padded = str(alpha).rsplit(alpha._separator)[-1] 
else:
  padded = str(alpha)

tsmathis · 2025-10-23T17:39:39Z

For the AlphaID, to handle either the no prefix/separator ("aaaaaaft") and with prefix/separator ("mp-aaaaaaft") cases, both of these should work, but I can also just save the "padded identifier" as an attr on it to make this cleaner - I'll do that in the PR you linked:

either way on this works for me, just want to make sure I stick to the intended usage (edit: or that we're at least consistent across the client)

Either way yeah I guess it returned the as_dict but we don't need to keep with that paradigm

Was going to say we could stick to whatever the frontend was expecting, but looking now the frontend doesn't even use the tasks.get_trajectory(...) function so it will need to be rewritten either way. The frontend does end up making a dataframe from the trajectory dict, so maybe just returning the dict will be best

codecov-commenter · 2025-10-23T17:52:28Z

Codecov Report

❌ Patch coverage is 42.20183% with 63 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.92%. Comparing base (5ecebec) to head (33b787f).

Files with missing lines	Patch %	Lines
mp_api/client/core/client.py	16.66%	45 Missing ⚠️
mp_api/client/core/utils.py	55.00%	18 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1023      +/-   ##
==========================================
- Coverage   66.85%   65.92%   -0.94%     
==========================================
  Files          50       50              
  Lines        2767     2870     +103     
==========================================
+ Hits         1850     1892      +42     
- Misses        917      978      +61

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tschaume

Very nice! Looking forward to rolling this out 😄

tschaume · 2025-10-23T17:48:26Z

mp_api/client/core/client.py

+                has_gnome_access = bool(
+                    self._submit_requests(
+                        url=urljoin(
+                            "https://api.materialsproject.org/", "materials/summary/"


use self.endpoint

should it be self.base_endpoint? I think this tripped me up when I tried using self.endpoint originally

api/mp_api/client/core/client.py

Lines 134 to 135 in 254c7d0

if self.suffix:

self.endpoint = urljoin(self.endpoint, self.suffix)

for the tasks rester -> self.endpoint caused the urljoin here to yield something like {base_url}/materials/tasks/materials/summary

right, self.base_endpoint should work.

tschaume · 2025-10-23T17:53:31Z

mp_api/client/core/client.py

+                            _flush(accumulator, group)
+                            group += 1
+                            size = 0
+                            accumulator = []


accumulator.clear() for better memory management?

tschaume · 2025-10-23T17:55:14Z

mp_api/client/core/settings.py

+        description="Threshold number of rows to accumulate in memory before flushing dataset to disk",
+    )
+
+    ACCESS_CONTROLLED_BATCH_IDS: list[str] = Field(


settings is the right place for the first two. Access-controlled batch ids should probably be hardcoded and change with client releases.

akin to one of these?

api/mp_api/client/mprester.py

Lines 475 to 488 in 254c7d0

def get_database_version(self):

"""The Materials Project database is periodically updated and has a

database version associated with it. When the database is updated,

consolidated data (information about "a material") may and does

change, while calculation data about a specific calculation task

remains unchanged and available for querying via its task_id.

The database version is set as a date in the format YYYY_MM_DD,

where "_DD" may be optional. An additional numerical suffix

might be added if multiple releases happen on the same day.

Returns: database version as a string

"""

return get(url=self.endpoint + "heartbeat").json()["db_version"]

Re: the use of self.endpoint? If so, then yes :)

Ah no, I mean for the access controlled batch ids.
Should those be added to the heartbeat so they aren't defined in the client code/settings? And then the client can just call get_access_controlled_batch_ids()

Yes, that's a good idea.

Feel free to start a PR to add it to the heartbeat_meta here.

tschaume and others added 3 commits October 22, 2025 13:12

exclude gnome for full downloads if needed

9d2048e

query s3 for trajectories

505ddfe

add deltalake query support

aee0f8c

linting + mistaken sed replace on 'where'

d5a25b1

tsmathis force-pushed the deltalake branch from 8c59af4 to d5a25b1 Compare October 23, 2025 16:04

tschaume mentioned this pull request Oct 23, 2025

exclude gnome for full downloads if needed #974

Closed

tsmathis added 3 commits October 23, 2025 10:40

return trajectory as pmg dict

2de051d

update trajectory test

7d0b8b7

correct docstrs

7195adf

tschaume reviewed Oct 23, 2025

View reviewed changes

tsmathis mentioned this pull request Oct 23, 2025

add BatchIDQuery to tasks_resource materialsproject/emmet#1330

Merged

Merge branch 'main' into deltalake

33b787f

	if self.suffix:
	self.endpoint = urljoin(self.endpoint, self.suffix)

	def get_database_version(self):
	"""The Materials Project database is periodically updated and has a
	database version associated with it. When the database is updated,
	consolidated data (information about "a material") may and does
	change, while calculation data about a specific calculation task
	remains unchanged and available for querying via its task_id.

	The database version is set as a date in the format YYYY_MM_DD,
	where "_DD" may be optional. An additional numerical suffix
	might be added if multiple releases happen on the same day.

	Returns: database version as a string
	"""
	return get(url=self.endpoint + "heartbeat").json()["db_version"]

Add Deltalake query support #1023

Are you sure you want to change the base?

Add Deltalake query support #1023

Uh oh!

Conversation

tsmathis commented Oct 23, 2025

Uh oh!

tsmathis commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tsmathis commented Oct 23, 2025

Uh oh!

esoteric-ephemera commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tsmathis commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tschaume left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tsmathis Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tsmathis Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tsmathis commented Oct 23, 2025 •

edited

Loading

esoteric-ephemera commented Oct 23, 2025 •

edited

Loading

tsmathis commented Oct 23, 2025 •

edited

Loading

codecov-commenter commented Oct 23, 2025 •

edited

Loading

tsmathis Oct 23, 2025 •

edited

Loading

tsmathis Oct 23, 2025 •

edited

Loading