Skip to content

Integrate with CLA? #79

@sciepsilon

Description

@sciepsilon

It might be of great benefit to integrate LingView with an archive, such as the Survey of California and Other Indian Languages / California Language Archive (https://cla.berkeley.edu/). But it's not easy. These great benefits come with great technical and design challenges.

Website structure and user experience

Current LingView website structure

The current version of LingView is intended to be the whole website. It includes a table of contents, a separate tab for the About page, and a page for each ELAN or FLEx file together with its metadata and media. There are several types of files that it can't display, including text files, word documents, images, and PDFs. All of these characteristics can be changed, if we want to, with a few weeks of engineering work.

Option A: two monoliths

No changes to LingView website structure. LingView website is separate from the current CLA website. If a CLA item is an ELAN or FLEx file, its page within CLA contains a link to its page within LingView, alongside other metadata about the item. Neither website needs to change its structure.

  • Typical workflow: user browses CLA website, finds an item of interest, reads its description in CLA, then clicks the link to its LingView page. On the item's LingView page, they watch parts of the audio or video in sync with the transcription, or they simply read the transcript and other annotations. Then they can navigate to LingView's table of contents to see other CLA materials that have LingView pages.
  • May have one LingView website for all LingView-relevant CLA materials
  • Or, may split the LingView-relevant materials among multiple LingView websites, each with their own table of contents and their own search/concordance page. If the size of json files is a problem, this approach solves it. This approach will also make the LingView sites load faster since each one will be smaller.

Option B: many mini LingViews

One LingView page per item. Get rid of LingView's table of contents and its search/concordance feature. If a CLA item is an ELAN or FLEx file, its page within CLA contains a link to its page within LingView, alongside other metadata about the item.

This option will take a week or so of engineering work, but the ability to build a "mini LingView", for a single ELAN or FLEx file with media, would be quite nice. It would make LingView much more versatile and allow it to be incorporated flexibly into other websites, beyond just CLA. We on the LingView project will need to figure out how we're going to support two forms of LingView (monolith and mini) side-by-side.

Option C (bad idea): merge LingView and CLA sites

Bad idea, but for completeness here it is: Replace the CLA website with a LingView site or a hybrid. Re-tool LingView's search features (which are currently quite rudimentary) so that users can adequately find CLA items. Input all document metadata into LingView's metadata format. Use Box to display as-yet-unsupported formats, such as text files and PDFs, within LingView. This would require lots of programming work and would probably degrade the user experience on CLA, so I don't recommend it.

Update 2021-02-16: Option D: One LingView per collection

Here's what we envision. This is a variant of Option A:

When a Berkeley researcher adds materials to CLA, the researcher will be asked if they also want to create a LingView page. If they say yes, they will be asked to curate a subset of their transcriptions. This curated set should contain only ELAN and FLEx files, with accompanying audio or video if available, and should not contain private materials. Then the researcher will work with CLA to create a LingView page, hosted on CLA servers. CLA will include a link to the LingView page on each item in the curated collection. Since the collection has its own LingView page, users can view the LingView table of contents to see other curated transcriptions in the same collection.

Limitations:

  • We'll need to figure out how to handle Box audio and video. This will take a week or two of engineering work to display the audio/video alongside its transcription, and several additional weeks to get synchronized scrolling to work. Synchronized scrolling might be impossible with Box, in which case we can investigate other options, such as using the copy of the video that's stored on Amazon. That copy costs money for each download, but might be worth the cost.
  • Requires the CLA server admin to help with setting up each LingView site. Therefore, requires the CLA server admin to be sufficiently enthusiastic.
  • At least in the short term, we would need to restrict this option to Berkeley researchers only to keep the admin's and servers' workloads manageable.
  • Transcriptions on CLA will only be viewable online if they're in a LingView curated collection. Users of CLA will need to download the FLEx or ELAN program, then download the FLEx or ELAN file they're interested in, in order to view non-LingView-curated transcriptions.

Existing technology, or, "Where to run?"

The CLA website includes this code in its source:

<script src="./js/jquery-3.3.1.js"></script>
<script src="./js/select2.js"></script>
<script src="./js/cla-select2.js"></script>
<script src="./js/edwin.js"></script>
<script type="application/javascript">
  configure_select2('https://xfhpwly456.execute-api.us-east-1.amazonaws.com/api/');
</script>

This leads me to believe that it's running on Amazon AWS, built in JavaScript, using a mixture of publicly-available libraries (jquery) and custom js scripts (cla-select2.js). It's likely that the server admins are either part of CLA, or affiliated with Berkeley more generally, and that they have pretty comprehensive control over the server. They can probably run server-side js such as LingView's preprocessing script on the same server(s), although they may be reluctant to do so, lest LingView go haywire and interfere with important website things. For safety, it may be better to run the preprocessing script in a development environment, and then copy the resulting files to the production server. This is what we used to do with the LingView site that was hosted on Brown's Center for Digital Scholarship server. The CLA admins may already have a setup like this.

If CLA doesn't already have a server where they're willing and able to run LingView's server-side js, they can host their LingView site on GitHub. Limitations: With a GitHub free account, all information in the repository is public (except for GitHub Secrets, which is only viable for unstructured data like passwords), and storage space is also limited. Workflow runs on a free account are limited, but should be adequate. If GitHub hosting isn't viable, CLA can get an adequate cloud server for $10-100/month.

File storage

We should estimate the storage needs for:

  • audio/video files
  • ELAN and FLEx files
  • json and bundle.js files produced by LingView

Files can be stored on the same computer that runs LingView, but often it's cheaper or more feasible to store large files elsewhere instead. This is called hosting those files remotely.

LingView already handles remotely stored audio and video files, but only if they are hosted on a plain file server or (for video) on YouTube. It doesn't currently handle remotely stored ELAN, FLEx, or json files.

Files on Box

Box is a cloud storage service, similar to DropBox. It's sometimes used for linguistic materials, including for ELAN files and videos. CLA uses it for all of their digital materials. We can get LingView to display videos that are hosted on Box, but this will take a few weeks of work, and some of LingView's text-sync and link-to-timestamp features might be impossible with Box videos; I'm not sure.

Specifically, LingView's text-sync and link-to-timestamp features are possible with Box IF all of the following conditions are met:

  • When a Box video is embedded in a webpage, js scripts on the page can play, pause, and jump to arbitrary times in the Box video.
  • When a Box video is embedded in a webpage, js scripts on the page can know the current time in the video. Box might allow this by sending playerTimeChanged events that a script can listen for, or there might be a currentPlayerTime property or a getPlayerTime() function that can be called by the script.

I'm pretty sure we can get LingView to work with ELAN and FLEx files that are hosted on Box, even without downloading them to the server where LingView's server-side js is being run. This is probably just a few days of work.

It may even be possible to store LingView's json and bundle.js files on Box. We can add this feature if it's possible to programmatically create/upload files to Box. This is probably about a week of work.

Detecting updates

From time to time, the source files displayed by a LingView site may change. For example, there may be additional translations added in an ELAN file, or a video file might be renamed, or else converted to a different format. LingView must be rebuilt (npm run quick-build) in order to display the changes. Until it is rebuilt, it will either display the old version, or if audio/video file names have changed, it will fail to display the file.

The normal recommended workflow is that whoever updates the ELAN file, video, etc. should then immediately run the rebuild command.

For LingView sites hosted on GitHub, the administrator may choose to enable GitHub Actions, which will automatically rebuild the site whenever any file in the repository is changed. However, if there are video files hosted outside of GitHub (as for the CofanALDP repo), a manual rebuild is still needed when a video file's name is changed.

For CLA, a different approach may be appropriate. For example, LingView could automatically rebuild every hour to ensure the site is never more than one hour out of date. This could be implemented with a minor code change. It may also be possible to detect when source files are updated and rebuild immediately, although this is a more difficult engineering challenge and the strategy would depend on how the source files are stored.

Update 2021-02-16: Since CLA is an archive, updates will be quite rare, so this is less of an issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions