USE 403 - include fulltext source by ghukill · Pull Request #379 · MITLibraries/timdex-index-manager

ghukill · 2026-02-10T14:36:03Z

Purpose and background context

This PR reintroduces the fulltext field back into the _source response from the Opensearch API.

It was determined that when bulk updating documents in Opensearch -- using the opensearch-py SDK and all recommended approaches -- it relies on the _source object. As we interleaved new embedding data during this bulk update, because fulltext was formerly excluded from _source, it was quietly dropping that field when writing the documents back to Opensearch.

As noted in the commit, excluding fulltext from the _source object does not reduce storage costs; it merely excludes it from getting transferred over-the-wire in responses if _source is requested or not explicitly ignored.

My proposal is that while we can potentially still optimize how fulltext is stored and/or returned in responses, by reintroducing it now to _source we solve the bulk update issue where it would get dropped and we simplify debugging of full-text in TIMDEX records in the short-term, which are under heavy churn at the moment.

UPDATE: After a big more digging, got confirmation that _source is required for bulk update operations per Elasticsearch documentation, which is unfortunately often more helpful than Opensearch docs:

"The _source field must be enabled to use this API."

But even then, it doesn't really highlight why, that values are pulled from _source, and so an exclusion would effectively wipe that field on update.

This Github "bug" ticket also mirrors our scenario perfectly: opensearch-project/k-NN#1694.

DOUBLE UPDATE:

It turns out this is very clearly laid out in the Opensearch docs:

It is not mentioned in the bulk API docs, which is what we're implicitly using by using the bulk indexing methods from the opensearch-py SDK.

Looking at the Opensearch docs related to _source specifically, there is another warning:

ANALYSIS POST UPDATES:

If memory serves, we looked into derived source when we started exploring both fulltext and adding embeddings. This may be a valuable undertaking as we're starting to stress the edges of a single, simple document.

As noted above, the proposal is to move ahead with this current change to re-add fulltext to _source to ensure that development around fulltext and embeddings can continue unimpeded. But it may be worth some cycles at some point to explore building a dervied source.

Noting that derivived source appears to be only available in Opensearch 3.x.

How can a reviewer manually see the effects of these changes?

The easiest way would likely be to run reindex-source for a source, noting that fulltext is present at the end. Short of doing that yourself, here is the effect:

documents are indexed
the embeddings update pass fires
when all is complete, documents still have their fulltext field

This change does result in fulltext being visibly present in the Opensearch records now inside _source. This can be fairly easily seen by following these instructions to start a tunnel to Dev1, then opening the dashboards, and running this query:

GET /mitlibwebsite/_search
{
  "query": {
    "query_string": {
      "query": "Violich AND Architecture"
    }
  },
  "_source": ["timdex_record_id", "title", "source_link", "fulltext"]
}

Example response:

{
        "_index": "mitlibwebsite-2026-02-10t16-27-42",
        "_id": "mitlibwebsite:32b8dd4e2459ae004d3e6de5c99bd313",
        "_score": 8.873661,
        "_source": {
          "fulltext": "2022: In the media DECEMBER 2022 Machine learning and the arts: A creative continuum MIT News CAST Visiting Artist Andreas Refsgaard engages the MIT community in the ethics and play of creative coding. NOVEMBER 2022 Celebrating open data MIT News New prize program recognizes MIT researchers who make data openly accessible and reusable. Uncovering the rich connections between South Asia and MIT MIT News Showcased in a new exhibit, student research explores the long history of South Asians at the Institute. JULY 2022 MIT Libraries staff honored with 2022 Infinite Mile Awards MIT News Seventeen staffers lauded for providing outstanding service, supporting their colleagues, and exemplifying the Libraries’ values. JUNE 2022 Novelist Elizabeth Acevedo’s work reflects the rich stories, traditions, and cultures of the Caribbean diaspora MIT News MIT Reads event moderated by Nailah J. Smith ’22 delights MIT audience. APRIL 2022 In the history lab, delving into the South Asian experience at MIT MIT News Students in 21H.S04 explore stories of students and faculty from South Asia via oral histories and the Institute Archives/Distinctive Collections. MARCH 2022 MIT Highlights Distinctive Collections Through “A Lab of One’s Own” Video Game Library Journal Distinctive Collections has upped the creativity factor with an immersive video game that allows players to discover archival materials telling the stories of women from MIT’s history. FEBRUARY 2022 MIT Hayden Memorial Library by Kennedy & Violich Architecture Architectural Record Kennedy & Violich bring a new vision to the Hayden Library at MIT JANUARY 2022 Immersive video game explores the history of women at MIT MIT News “A Lab of One’s Own” invites players to engage with archival materials in a virtual environment. How European Royals Once Shared Their Most Important Secrets The New York Times Recent research highlights the use of letterlocking techniques by Queen Elizabeth, Catherine de’ Medici and Mary Queen of Scots.",
          "title": "2022: In the media | News",
          "source_link": "https://libraries.mit.edu/news/in-the-media/2022-the-media/",
          "timdex_record_id": "mitlibwebsite:32b8dd4e2459ae004d3e6de5c99bd313"
        }
      }

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: fulltext is now present in Opensearch API responses by default, as part of _source if not filtered to specific fields

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-403

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: Originally, we found excluding the newish fulltext field from the _source mapping in Opensearch was beneficial. This would keep the fulltext from getting returned during casual Opensearch queries and/or when an explicit list of fields to return was not used. What this exclusion does NOT do -- and this is important to highlight -- is reduce storage in any meaningful way. At this time, the way in which we store fulltext in Opensearch has the same storage footprint if we exclude from _source or not. That exclusion DID have a negative side effect however. We later implemented a TIM command that would update records in Opensearch with embeddings, using opensearch-py built-in, bulk update methods. It would appear, though not well documented, that because fulltext was excluded from the _source fields returned, it was removing that field value when the document was updated. Conceptually, it's like Opensearch extracts the document as represented by _source, interleaves updates, and then reinserts. How this addresses that need: Because the `fulltext` exclusion field from _source really had no positive effects, but was found to have negative ones, electing to remove that exclusion. This will result in fulltext getting included in the default _source response. It is worth noting that in the TIMDEX API -- the largest usage of Opensearch -- we do explicitly state columns for returning, so it's believed we are not paying the penalty of transferring the fulltext over the wire each request. This will be confirmed and addressed outside the scope of this work. Side effects of this change: * fulltext is returned by default as part of _source document * bulk updating of documents does not remove the fulltext field Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-403

coveralls · 2026-02-10T14:38:44Z

Pull Request Test Coverage Report for Build 21908100903

Details

1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 95.6%

Totals
Change from base Build 21527292422:	0.0%
Covered Lines:	478
Relevant Lines:	500

💛 - Coveralls

JPrevost

Blocking until we assess the effects of this on TIMDEX API. I suspect this will require changes to both OpenSearch mappings and TIMDEX API to allow us to stop relying on _source in TIMDEX API. That is probably a good thing, but it is complex and needs a plan before we introduce this change.

JPrevost

My concerns have been addressed in various conversations.

This change makes sense. We don't want this in production tier until TIMDEX API moves to excluding fulltext and embeddings fields from source, but we have a ticket and intention to land that this week so hopefully keeping this out of production is a very short request on our end.

JPrevost · 2026-02-10T21:20:54Z

tim/helpers.py


 def get_source_from_index(index_name: str) -> str:
-    return index_name.split("-")[0]
+    return index_name.split("-", maxsplit=1)[0]


My review is explicitly not covering this change. It should be reviewed by the python team.

Glad you called this out! This almost prompted a philosophical musing in the engineering channel.

This update was prompted by failed linting rule, ruff rule PLC0207, which I'm assuming threw because we updated ruff. I considered just ignoring it, as we knew the logic was sound, but it feels like a good habit to form.

I was musing that linting rules often expose new parts of libraries and languages I was not aware of, and I'm grateful for that constant, integrated learning. And, occassionally annoyed by it. This felt like the former 😄.

Gotcha, this is a good reminder of why I tend to avoid introducing dep updates alongside functional changes 😄

It looks like a nonsensical change in the context of the intentionally introduced change, it probably will never make a meaningful difference in our environment, and is basically at the whims of a script that happened to update. There is absolutely nothing wrong with the change, but it is noise when introduced along with functional code changes.

ehanson8

Looks good!

ghukill added 2 commits February 10, 2026 09:13

Update dependencies

dc605a1

ghukill requested a review from a team February 10, 2026 14:52

ghukill marked this pull request as ready for review February 10, 2026 15:00

ghukill requested a review from a team as a code owner February 10, 2026 15:00

JPrevost requested changes Feb 10, 2026

View reviewed changes

JPrevost approved these changes Feb 10, 2026

View reviewed changes

Update CODEOWNERS default owner

d0b6d08

ehanson8 self-assigned this Feb 11, 2026

ehanson8 approved these changes Feb 11, 2026

View reviewed changes

ghukill merged commit 69959c0 into main Feb 11, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USE 403 - include fulltext source#379

USE 403 - include fulltext source#379
ghukill merged 3 commits intomainfrom
USE-403-include-fulltext-source

ghukill commented Feb 10, 2026 •

edited

Loading

Uh oh!

coveralls commented Feb 10, 2026 •

edited

Loading

Uh oh!

JPrevost left a comment •

edited

Loading

Uh oh!

JPrevost left a comment

Uh oh!

JPrevost Feb 10, 2026

Uh oh!

ghukill Feb 10, 2026

Uh oh!

JPrevost Feb 10, 2026

Uh oh!

ehanson8 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ghukill commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

coveralls commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 21908100903

Details

💛 - Coveralls

Uh oh!

JPrevost left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JPrevost left a comment

Choose a reason for hiding this comment

Uh oh!

JPrevost Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ghukill Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

JPrevost Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ghukill commented Feb 10, 2026 •

edited

Loading

coveralls commented Feb 10, 2026 •

edited

Loading

JPrevost left a comment •

edited

Loading