Conversation
Why these changes are being introduced: Originally, we found excluding the newish fulltext field from the _source mapping in Opensearch was beneficial. This would keep the fulltext from getting returned during casual Opensearch queries and/or when an explicit list of fields to return was not used. What this exclusion does NOT do -- and this is important to highlight -- is reduce storage in any meaningful way. At this time, the way in which we store fulltext in Opensearch has the same storage footprint if we exclude from _source or not. That exclusion DID have a negative side effect however. We later implemented a TIM command that would update records in Opensearch with embeddings, using opensearch-py built-in, bulk update methods. It would appear, though not well documented, that because fulltext was excluded from the _source fields returned, it was removing that field value when the document was updated. Conceptually, it's like Opensearch extracts the document as represented by _source, interleaves updates, and then reinserts. How this addresses that need: Because the `fulltext` exclusion field from _source really had no positive effects, but was found to have negative ones, electing to remove that exclusion. This will result in fulltext getting included in the default _source response. It is worth noting that in the TIMDEX API -- the largest usage of Opensearch -- we do explicitly state columns for returning, so it's believed we are not paying the penalty of transferring the fulltext over the wire each request. This will be confirmed and addressed outside the scope of this work. Side effects of this change: * fulltext is returned by default as part of _source document * bulk updating of documents does not remove the fulltext field Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-403
Pull Request Test Coverage Report for Build 21908100903Details
💛 - Coveralls |
There was a problem hiding this comment.
Blocking until we assess the effects of this on TIMDEX API. I suspect this will require changes to both OpenSearch mappings and TIMDEX API to allow us to stop relying on _source in TIMDEX API. That is probably a good thing, but it is complex and needs a plan before we introduce this change.
JPrevost
left a comment
There was a problem hiding this comment.
My concerns have been addressed in various conversations.
This change makes sense. We don't want this in production tier until TIMDEX API moves to excluding fulltext and embeddings fields from source, but we have a ticket and intention to land that this week so hopefully keeping this out of production is a very short request on our end.
|
|
||
| def get_source_from_index(index_name: str) -> str: | ||
| return index_name.split("-")[0] | ||
| return index_name.split("-", maxsplit=1)[0] |
There was a problem hiding this comment.
My review is explicitly not covering this change. It should be reviewed by the python team.
There was a problem hiding this comment.
Glad you called this out! This almost prompted a philosophical musing in the engineering channel.
This update was prompted by failed linting rule, ruff rule PLC0207, which I'm assuming threw because we updated ruff. I considered just ignoring it, as we knew the logic was sound, but it feels like a good habit to form.
I was musing that linting rules often expose new parts of libraries and languages I was not aware of, and I'm grateful for that constant, integrated learning. And, occassionally annoyed by it. This felt like the former 😄.
There was a problem hiding this comment.
Gotcha, this is a good reminder of why I tend to avoid introducing dep updates alongside functional changes 😄
It looks like a nonsensical change in the context of the intentionally introduced change, it probably will never make a meaningful difference in our environment, and is basically at the whims of a script that happened to update. There is absolutely nothing wrong with the change, but it is noise when introduced along with functional code changes.
Purpose and background context
This PR reintroduces the
fulltextfield back into the_sourceresponse from the Opensearch API.It was determined that when bulk updating documents in Opensearch -- using the
opensearch-pySDK and all recommended approaches -- it relies on the_sourceobject. As we interleaved new embedding data during this bulk update, becausefulltextwas formerly excluded from_source, it was quietly dropping that field when writing the documents back to Opensearch.As noted in the commit, excluding
fulltextfrom the_sourceobject does not reduce storage costs; it merely excludes it from getting transferred over-the-wire in responses if_sourceis requested or not explicitly ignored.My proposal is that while we can potentially still optimize how
fulltextis stored and/or returned in responses, by reintroducing it now to_sourcewe solve the bulk update issue where it would get dropped and we simplify debugging of full-text in TIMDEX records in the short-term, which are under heavy churn at the moment.UPDATE: After a big more digging, got confirmation that
_sourceis required for bulk update operations per Elasticsearch documentation, which is unfortunately often more helpful than Opensearch docs:But even then, it doesn't really highlight why, that values are pulled from
_source, and so an exclusion would effectively wipe that field on update.This Github "bug" ticket also mirrors our scenario perfectly: opensearch-project/k-NN#1694.
DOUBLE UPDATE:
It turns out this is very clearly laid out in the Opensearch docs:
It is not mentioned in the bulk API docs, which is what we're implicitly using by using the bulk indexing methods from the
opensearch-pySDK.Looking at the Opensearch docs related to
_sourcespecifically, there is another warning:ANALYSIS POST UPDATES:
If memory serves, we looked into derived source when we started exploring both fulltext and adding embeddings. This may be a valuable undertaking as we're starting to stress the edges of a single, simple document.
As noted above, the proposal is to move ahead with this current change to re-add
fulltextto_sourceto ensure that development around fulltext and embeddings can continue unimpeded. But it may be worth some cycles at some point to explore building a dervied source.Noting that derivived source appears to be only available in Opensearch 3.x.
How can a reviewer manually see the effects of these changes?
The easiest way would likely be to run
reindex-sourcefor a source, noting thatfulltextis present at the end. Short of doing that yourself, here is the effect:fulltextfieldThis change does result in
fulltextbeing visibly present in the Opensearch records now inside_source. This can be fairly easily seen by following these instructions to start a tunnel to Dev1, then opening the dashboards, and running this query:Example response:
{ "_index": "mitlibwebsite-2026-02-10t16-27-42", "_id": "mitlibwebsite:32b8dd4e2459ae004d3e6de5c99bd313", "_score": 8.873661, "_source": { "fulltext": "2022: In the media DECEMBER 2022 Machine learning and the arts: A creative continuum MIT News CAST Visiting Artist Andreas Refsgaard engages the MIT community in the ethics and play of creative coding. NOVEMBER 2022 Celebrating open data MIT News New prize program recognizes MIT researchers who make data openly accessible and reusable. Uncovering the rich connections between South Asia and MIT MIT News Showcased in a new exhibit, student research explores the long history of South Asians at the Institute. JULY 2022 MIT Libraries staff honored with 2022 Infinite Mile Awards MIT News Seventeen staffers lauded for providing outstanding service, supporting their colleagues, and exemplifying the Libraries’ values. JUNE 2022 Novelist Elizabeth Acevedo’s work reflects the rich stories, traditions, and cultures of the Caribbean diaspora MIT News MIT Reads event moderated by Nailah J. Smith ’22 delights MIT audience. APRIL 2022 In the history lab, delving into the South Asian experience at MIT MIT News Students in 21H.S04 explore stories of students and faculty from South Asia via oral histories and the Institute Archives/Distinctive Collections. MARCH 2022 MIT Highlights Distinctive Collections Through “A Lab of One’s Own” Video Game Library Journal Distinctive Collections has upped the creativity factor with an immersive video game that allows players to discover archival materials telling the stories of women from MIT’s history. FEBRUARY 2022 MIT Hayden Memorial Library by Kennedy & Violich Architecture Architectural Record Kennedy & Violich bring a new vision to the Hayden Library at MIT JANUARY 2022 Immersive video game explores the history of women at MIT MIT News “A Lab of One’s Own” invites players to engage with archival materials in a virtual environment. How European Royals Once Shared Their Most Important Secrets The New York Times Recent research highlights the use of letterlocking techniques by Queen Elizabeth, Catherine de’ Medici and Mary Queen of Scots.", "title": "2022: In the media | News", "source_link": "https://libraries.mit.edu/news/in-the-media/2022-the-media/", "timdex_record_id": "mitlibwebsite:32b8dd4e2459ae004d3e6de5c99bd313" } }Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES:
fulltextis now present in Opensearch API responses by default, as part of_sourceif not filtered to specific fieldsWhat are the relevant tickets?
Code review