A: Maintainability and adoption #3

ml-evs · 2021-12-21T15:42:27Z

ml-evs
Dec 21, 2021
Maintainer

This thread is for discussing, finalizing and organizing a breakout with the theme in the title above.

Maintainability and adoption:

How can we make sure that this project will not die and will be used?

What can we learn from OPTIMADE or similar efforts (maybe, ORD, AniML)?

What made the CIF a success story?

What do we think about models such as the one of the Allotrope Foundation?

Software user facilities in analogy to other user facilities universities have?

Centralization vs. collaborative custodianship (i.e., in collaborative projects [BIG-MAP, NCCR, …] do we aim for one centralized data lake/store or do we only have a data registry?)

How do we solve the collective action problem that everyone sees the need for high-quality data, but no one wants to start sharing their data in re-usable form?

Do we need journals/editors to buy in?

Do we need changes in education?

kjappelbaum · 2021-12-23T09:40:40Z

kjappelbaum
Dec 23, 2021
Maintainer

quite fuzzy vision, inspired by https://colinraffel.com/blog/a-call-to-build-models-like-we-build-open-source-software.html: "Can we manage scientific data as open source software?"
Can we arrive at some point where I can make a "merge request" to some dataset, fork it to build on top of it,... and on the fly the tooling will keep track of the "credit record" and Google scholar will also show this as a form of output

1 reply

ml-evs Dec 23, 2021
Maintainer Author

There is certainly a discussion to be had about designing for dynamic/shifting schemas, and the technicalities of semantic versioning, actually semantic versioning (i.e. an ontology of versioning that goes beyond that for software - this was discussed at OMDI 2021 briefly) and version negotiation. Some nice notes here from Ink & Switch labs on API versioning, decentralized schemas and lenses into different versions: https://www.inkandswitch.com/cambria/

Seems to me that is combines all of the worst topics in software engineering: dependency management, versioning, database migrations...

kjappelbaum · 2021-12-23T12:04:02Z

kjappelbaum
Dec 23, 2021
Maintainer

what are the incentives for maintaining something like this? Had a couple of proposals killed because "it is not science". How can we, as a community, fix the incentive structure for the developers and maintainers of data infrastructure?

0 replies

ml-evs · 2022-01-17T11:47:40Z

ml-evs
Jan 17, 2022
Maintainer Author

On the adoption front, training/education plays a big part. What standards/guidelines/technologies/tools would you teach new undergrads today?

0 replies

ml-evs · 2022-01-17T18:50:36Z

ml-evs
Jan 17, 2022
Maintainer Author

Also: what are the minimum technical meta-requirements for a standard/ontology/schema/data model to be adopted?

Semantic versioning
Detailed changelog
Machine-readable specification
Tooling for validation: both local and remote (e.g., OPTIMADE validator from optimade-python-tools which can be run locally, as a GitHub action, or automated across all providers at https://optimade.org/providers-dashboard
Ideally, open source with transparent decision making

I think these could apply to things we typically think of as standards, but also simply datasets/databases published on the web.

Does adoption mean something different for software, specifications and data?

0 replies

helgestein · 2022-02-03T06:22:03Z

helgestein
Feb 3, 2022

I guess for experimental data is is a large challenge but some groups have dome something to get there. There (used to be) HTE-MEAD, then there is HTEM and others. At my university we have some thing called KADI4MAT. Most ELN are append only which would make them amendable for something like merge requests etc. What I find more important though and what could be a future vision is that you absolutely need to store the analysis and plotting code alongside your (experimental) analysed data. Gold standard would probably be a versioned docker container of some micro data analysis and plotting service with that you can just rerun the analysis of old data.
Bigger challenge with all of this is: funding, credit and making a career. You need a dedicated person/team for this that breaks the 3-4y PhD cycle.

2 replies

kjappelbaum Feb 3, 2022
Maintainer

We will also have a talk from the KADI4MAT developers! 👍🏽

Most ELN are append only which would make them amendable for something like merge requests etc.

Indeed, this is a huge challenge. And many commercial ones will also look you into their ecosystem.

What I find more important though and what could be a future vision is that you absolutely need to store the analysis and plotting code alongside your (experimental) analysed data. Gold

And in this context, how does one deal with legacy software/tooling that is too hard to rewrite as open source tools? How do we capture those steps in kind of a "provenance graph of the sample".

Bigger challenge with all of this is: funding, credit and making a career. You need a dedicated person/team for this that breaks the 3-4y PhD cycle.

In Germany you're lucky with the NFDIs ;) but even there, what happens after the funding period? Links perhaps a bit to #3 (comment): Would tooling that takes track of the "credit record" and spits out a list of citations you can put into your paper help (given that google scholar at some point hopefully indexes citations for datasets and code)

ml-evs Feb 3, 2022
Maintainer Author

Gold standard would probably be a versioned docker container of some micro data analysis and plotting service with that you can just rerun the analysis of old data.

This is something we wrote as part of our (@kjappelbaum, me, et al.) BIG-MAP stakeholder project --- what tooling would be required to make this step easy/possible for non-experts? As this becomes quite technical, what technologies can we rely on for this kind of archival work? We'd obviously also like to encourage that the data itself is provided in a sensible (or even machine-actionable) way, independent of the code used for analysis, does providing a docker image without any additional schema/standards really add much more than the classic zip file with a README? Perhaps one for the hacking session too...

kjappelbaum · 2022-02-05T13:27:10Z

kjappelbaum
Feb 5, 2022
Maintainer

Might be a bit too philosophical, but I really like how Nielsen phrases the problem as collective action problem.

The way to overcome this according to Olson is:

small groups: social pressure
big players in the group
external force: incentives, making it compulsory

and it seems also intuitive that these would be the key ingredients for something that works in the sciences.

an interesting tangent then is, however, how this plays with the cathedral and bazar

can we come up with the key elements of good governance for such data infrastructure in the sciences? Completely disorganized does not work, but also completely centralized does not work.

0 replies

ml-evs · 2022-02-07T15:05:43Z

ml-evs
Feb 7, 2022
Maintainer Author

From @jschrier's keynote:

We must demonstrate value for rank-and-file experimental chemists.

Carrot: case studies showing increase rate of discovery/productivity

Stick: policies of journals, funders and user facilities

Do not be afraid of 80% solutions:

Technology activation barrier is the major challenge

What can a single user adopt that creates value?

Lightweight overlays on top of existing tools (e.g. spreadsheets) might be enough

Making onerous tasks easier might be a way to attract users

0 replies

kjappelbaum · 2022-02-07T15:22:47Z

kjappelbaum
Feb 7, 2022
Maintainer

From chat (Michael)

Why do you think it is easier to learn programming for a chemist than chemistry to a computer scientist? For reusable code you need a good code structure which is a competence of a computer scientist. But a computer scientist do not understand chemistry in its all. Only the needs of a chemist needs to understood. Motivation is more importend

2 replies

kjappelbaum Feb 7, 2022
Maintainer

from chat (Ulrich)

Because - also related to the current talk - the language of chemistry is much more complex and also fully of "fuzzy" concepts than any programming language - you need experience to judge data and grasp in particular the borderline cases

kjappelbaum Feb 7, 2022
Maintainer

Michael

Yes, but you do not understand all the words if you write a program. You need to understand how to write a program to combine this words. The words need to be entered by a chemist not the code. Think of a database system the programmer even do not know that a chemist could use it but there is already code for it. The same aplies for ELN and RDM...

kjappelbaum · 2022-02-07T15:23:05Z

kjappelbaum
Feb 7, 2022
Maintainer

From Chat (Chloe)

Thank you for a great talk. Also arguably need to emphasise importance for more established PIs who won't ever have the time to learn and use new systems, but need to understand their value so that they allow their students to spend time on it which may mean less time 'at the bench'

1 reply

kjappelbaum Feb 7, 2022
Maintainer

from chat (Steven)

^ Totally agree with Chloe! In my experience there’s a great deal of ‘oral tradition’ in science where people tend to do things (such as in computational workflows) the way their mentoring grad student / postdoc taught them how do. This would then emphasize the importance of people who train early career scientists being on board with doing things a certain way.

kjappelbaum · 2022-02-07T15:30:33Z

kjappelbaum
Feb 7, 2022
Maintainer

in chat (Kenneth)

(More relevant to the first talk) With my work with industry, while we always have some enthusiastic early adopters, the majority of material scientists in a given organization are very allergic to code. Those staff are also very important people to extract and structure knowledge from, because those are the folks that have developed sophisticated intuition for how to build systems, even if they cannot articulate it. Can you propose strategies for engaging with those colleagues?

0 replies

kjappelbaum · 2022-02-07T15:36:24Z

kjappelbaum
Feb 7, 2022
Maintainer

@shyamd from chat

What if we started using a publishing format that enabled/required machine-readable formats with their human equivalents? Picture also embedded the csv, etc.

0 replies

giovannipizzi · 2022-02-09T10:48:09Z

giovannipizzi
Feb 9, 2022

As I just mentioned in the breakout session, I think an interesting model to look at is the one of 2i2c that is developing models to make sustainable the development and deployment of open-source infrastructure for interactive computing in research and education (mostly coming from the Jupyter ecosystem, but I think they are open to more).

Pinging here the director Chris Holdgraf, so he's aware of this workshop, and also he might be a person to be in contact with for further discussions: @choldgraf

1 reply

choldgraf Feb 10, 2022

This sounds like an awesome discussion that is near and dear to my heart :-)

Obviously I do not have any easy answers, but do have some experiments! 2i2c's basic assumption is that many organizations in research/education want access to open source interactive computing environments that are built with community-owned infrastructure. BUT, that they don't want to manage cloud environments themselves. So by forming a non-profit that both respects the community's Right to Replicate their infrastructure but also provides managed jupyter distributions in the cloud, we will be able to meet a need of the research community without locking them in to a vendor-specific platform. Because we're a non-profit, we can also constrain ourselves to re-invest our resources back into the open source ecosystem.

I hope that this model of "the technology is free, open source, and community driven, but organizations may offer services around managing the technology that are paid, in order to sustain themselves" can be replicated to other fields / tech / etc as well. Effectively, 2i2c is an experiment in pooling the resources from many universities (in the form of "service contracts"), so that universities can pay for 10% of a cloud devops FTE, rather than 100% of one. 2i2c is an experiment in this direction, and I would love to see others in this space similarly experimenting.

Also wanna ping a collaborator and 2i2c co-founder Ryan Abernathey (@rabernat), who has shown a lot of leadership in the data+cloud space via the Pangeo project. He has a project called Pangeo Forge that feels relevant to the conversation here as well.

I have no idea if any of the above is useful or interesting, but I think it's a super important topic to protect this space from vertical stacks and vendor lock-in!

A: Maintainability and adoption #3

Uh oh!

Uh oh!

ml-evs Dec 21, 2021 Maintainer

Maintainability and adoption:

Replies: 12 comments · 7 replies

Uh oh!

kjappelbaum Dec 23, 2021 Maintainer

Uh oh!

ml-evs Dec 23, 2021 Maintainer Author

Uh oh!

kjappelbaum Dec 23, 2021 Maintainer

Uh oh!

ml-evs Jan 17, 2022 Maintainer Author

Uh oh!

Uh oh!

ml-evs Jan 17, 2022 Maintainer Author

Uh oh!

helgestein Feb 3, 2022

Uh oh!

kjappelbaum Feb 3, 2022 Maintainer

Uh oh!

ml-evs Feb 3, 2022 Maintainer Author

Uh oh!

Uh oh!

kjappelbaum Feb 5, 2022 Maintainer

Uh oh!

ml-evs Feb 7, 2022 Maintainer Author

Uh oh!

Uh oh!

kjappelbaum Feb 7, 2022 Maintainer

Uh oh!

Uh oh!

kjappelbaum Feb 7, 2022 Maintainer

Uh oh!

kjappelbaum Feb 7, 2022 Maintainer

Uh oh!

kjappelbaum Feb 7, 2022 Maintainer

Uh oh!

kjappelbaum Feb 7, 2022 Maintainer

Uh oh!

kjappelbaum Feb 7, 2022 Maintainer

Uh oh!

Uh oh!

kjappelbaum Feb 7, 2022 Maintainer

Uh oh!

giovannipizzi Feb 9, 2022

Uh oh!

Uh oh!

choldgraf Feb 10, 2022

ml-evs
Dec 21, 2021
Maintainer

Replies: 12 comments 7 replies

kjappelbaum
Dec 23, 2021
Maintainer

ml-evs Dec 23, 2021
Maintainer Author

kjappelbaum
Dec 23, 2021
Maintainer

ml-evs
Jan 17, 2022
Maintainer Author

ml-evs
Jan 17, 2022
Maintainer Author

helgestein
Feb 3, 2022

kjappelbaum Feb 3, 2022
Maintainer

ml-evs Feb 3, 2022
Maintainer Author

kjappelbaum
Feb 5, 2022
Maintainer

ml-evs
Feb 7, 2022
Maintainer Author

kjappelbaum
Feb 7, 2022
Maintainer

kjappelbaum Feb 7, 2022
Maintainer

kjappelbaum Feb 7, 2022
Maintainer

kjappelbaum
Feb 7, 2022
Maintainer

kjappelbaum Feb 7, 2022
Maintainer

kjappelbaum
Feb 7, 2022
Maintainer

kjappelbaum
Feb 7, 2022
Maintainer

giovannipizzi
Feb 9, 2022