Skip to content

USE 403 - include fulltext source#379

Merged
ghukill merged 3 commits intomainfrom
USE-403-include-fulltext-source
Feb 11, 2026
Merged

USE 403 - include fulltext source#379
ghukill merged 3 commits intomainfrom
USE-403-include-fulltext-source

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Feb 10, 2026

Purpose and background context

This PR reintroduces the fulltext field back into the _source response from the Opensearch API.

It was determined that when bulk updating documents in Opensearch -- using the opensearch-py SDK and all recommended approaches -- it relies on the _source object. As we interleaved new embedding data during this bulk update, because fulltext was formerly excluded from _source, it was quietly dropping that field when writing the documents back to Opensearch.

As noted in the commit, excluding fulltext from the _source object does not reduce storage costs; it merely excludes it from getting transferred over-the-wire in responses if _source is requested or not explicitly ignored.

My proposal is that while we can potentially still optimize how fulltext is stored and/or returned in responses, by reintroducing it now to _source we solve the bulk update issue where it would get dropped and we simplify debugging of full-text in TIMDEX records in the short-term, which are under heavy churn at the moment.

UPDATE: After a big more digging, got confirmation that _source is required for bulk update operations per Elasticsearch documentation, which is unfortunately often more helpful than Opensearch docs:

"The _source field must be enabled to use this API."

But even then, it doesn't really highlight why, that values are pulled from _source, and so an exclusion would effectively wipe that field on update.

This Github "bug" ticket also mirrors our scenario perfectly: opensearch-project/k-NN#1694.

DOUBLE UPDATE:

It turns out this is very clearly laid out in the Opensearch docs:

Screenshot 2026-02-10 at 10 44 23 AM

It is not mentioned in the bulk API docs, which is what we're implicitly using by using the bulk indexing methods from the opensearch-py SDK.

Looking at the Opensearch docs related to _source specifically, there is another warning:

Screenshot 2026-02-10 at 10 49 09 AM

ANALYSIS POST UPDATES:

If memory serves, we looked into derived source when we started exploring both fulltext and adding embeddings. This may be a valuable undertaking as we're starting to stress the edges of a single, simple document.

As noted above, the proposal is to move ahead with this current change to re-add fulltext to _source to ensure that development around fulltext and embeddings can continue unimpeded. But it may be worth some cycles at some point to explore building a dervied source.

Noting that derivived source appears to be only available in Opensearch 3.x.

How can a reviewer manually see the effects of these changes?

The easiest way would likely be to run reindex-source for a source, noting that fulltext is present at the end. Short of doing that yourself, here is the effect:

  1. documents are indexed
  2. the embeddings update pass fires
  3. when all is complete, documents still have their fulltext field

This change does result in fulltext being visibly present in the Opensearch records now inside _source. This can be fairly easily seen by following these instructions to start a tunnel to Dev1, then opening the dashboards, and running this query:

GET /mitlibwebsite/_search
{
  "query": {
    "query_string": {
      "query": "Violich AND Architecture"
    }
  },
  "_source": ["timdex_record_id", "title", "source_link", "fulltext"]
}

Example response:

{
        "_index": "mitlibwebsite-2026-02-10t16-27-42",
        "_id": "mitlibwebsite:32b8dd4e2459ae004d3e6de5c99bd313",
        "_score": 8.873661,
        "_source": {
          "fulltext": "2022: In the media DECEMBER 2022 Machine learning and the arts: A creative continuum MIT News CAST Visiting Artist Andreas Refsgaard engages the MIT community in the ethics and play of creative coding. NOVEMBER 2022 Celebrating open data MIT News New prize program recognizes MIT researchers who make data openly accessible and reusable. Uncovering the rich connections between South Asia and MIT MIT News Showcased in a new exhibit, student research explores the long history of South Asians at the Institute. JULY 2022 MIT Libraries staff honored with 2022 Infinite Mile Awards MIT News Seventeen staffers lauded for providing outstanding service, supporting their colleagues, and exemplifying the Libraries’ values. JUNE 2022 Novelist Elizabeth Acevedo’s work reflects the rich stories, traditions, and cultures of the Caribbean diaspora MIT News MIT Reads event moderated by Nailah J. Smith ’22 delights MIT audience. APRIL 2022 In the history lab, delving into the South Asian experience at MIT MIT News Students in 21H.S04 explore stories of students and faculty from South Asia via oral histories and the Institute Archives/Distinctive Collections. MARCH 2022 MIT Highlights Distinctive Collections Through “A Lab of One’s Own” Video Game Library Journal Distinctive Collections has upped the creativity factor with an immersive video game that allows players to discover archival materials telling the stories of women from MIT’s history. FEBRUARY 2022 MIT Hayden Memorial Library by Kennedy & Violich Architecture Architectural Record Kennedy & Violich bring a new vision to the Hayden Library at MIT JANUARY 2022 Immersive video game explores the history of women at MIT MIT News “A Lab of One’s Own” invites players to engage with archival materials in a virtual environment. How European Royals Once Shared Their Most Important Secrets The New York Times Recent research highlights the use of letterlocking techniques by Queen Elizabeth, Catherine de’ Medici and Mary Queen of Scots.",
          "title": "2022: In the media | News",
          "source_link": "https://libraries.mit.edu/news/in-the-media/2022-the-media/",
          "timdex_record_id": "mitlibwebsite:32b8dd4e2459ae004d3e6de5c99bd313"
        }
      }

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: fulltext is now present in Opensearch API responses by default, as part of _source if not filtered to specific fields

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:

Originally, we found excluding the newish fulltext field from the _source mapping
in Opensearch was beneficial.  This would keep the fulltext from getting returned
during casual Opensearch queries and/or when an explicit list of fields to return
was not used.

What this exclusion does NOT do -- and this is important to highlight -- is reduce
storage in any meaningful way.  At this time, the way in which we store fulltext
in Opensearch has the same storage footprint if we exclude from _source or not.

That exclusion DID have a negative side effect however.  We later implemented a
TIM command that would update records in Opensearch with embeddings, using
opensearch-py built-in, bulk update methods.

It would appear, though not well documented, that because fulltext was excluded
from the _source fields returned, it was removing that field value when the
document was updated.  Conceptually, it's like Opensearch extracts the document
as represented by _source, interleaves updates, and then reinserts.

How this addresses that need:

Because the `fulltext` exclusion field from _source really had no positive effects,
but was found to have negative ones, electing to remove that exclusion.  This will
result in fulltext getting included in the default _source response.

It is worth noting that in the TIMDEX API -- the largest usage of Opensearch -- we
do explicitly state columns for returning, so it's believed we are not paying the
penalty of transferring the fulltext over the wire each request.  This will be
confirmed and addressed outside the scope of this work.

Side effects of this change:
* fulltext is returned by default as part of _source document
* bulk updating of documents does not remove the fulltext field

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-403
@coveralls
Copy link

coveralls commented Feb 10, 2026

Pull Request Test Coverage Report for Build 21908100903

Details

  • 1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 95.6%

Totals Coverage Status
Change from base Build 21527292422: 0.0%
Covered Lines: 478
Relevant Lines: 500

💛 - Coveralls

@ghukill ghukill requested a review from a team February 10, 2026 14:52
@ghukill ghukill marked this pull request as ready for review February 10, 2026 15:00
@ghukill ghukill requested a review from a team as a code owner February 10, 2026 15:00
Copy link
Member

@JPrevost JPrevost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking until we assess the effects of this on TIMDEX API. I suspect this will require changes to both OpenSearch mappings and TIMDEX API to allow us to stop relying on _source in TIMDEX API. That is probably a good thing, but it is complex and needs a plan before we introduce this change.

Copy link
Member

@JPrevost JPrevost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concerns have been addressed in various conversations.

This change makes sense. We don't want this in production tier until TIMDEX API moves to excluding fulltext and embeddings fields from source, but we have a ticket and intention to land that this week so hopefully keeping this out of production is a very short request on our end.


def get_source_from_index(index_name: str) -> str:
return index_name.split("-")[0]
return index_name.split("-", maxsplit=1)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My review is explicitly not covering this change. It should be reviewed by the python team.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad you called this out! This almost prompted a philosophical musing in the engineering channel.

This update was prompted by failed linting rule, ruff rule PLC0207, which I'm assuming threw because we updated ruff. I considered just ignoring it, as we knew the logic was sound, but it feels like a good habit to form.

I was musing that linting rules often expose new parts of libraries and languages I was not aware of, and I'm grateful for that constant, integrated learning. And, occassionally annoyed by it. This felt like the former 😄.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, this is a good reminder of why I tend to avoid introducing dep updates alongside functional changes 😄

It looks like a nonsensical change in the context of the intentionally introduced change, it probably will never make a meaningful difference in our environment, and is basically at the whims of a script that happened to update. There is absolutely nothing wrong with the change, but it is noise when introduced along with functional code changes.

@ehanson8 ehanson8 self-assigned this Feb 11, 2026
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@ghukill ghukill merged commit 69959c0 into main Feb 11, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants