Skip to content

Questions about training data composition and data processing pipeline #16

@study-overflow

Description

@study-overflow

Hi, thank you for your great work on lingbot-world — it’s an impressive and inspiring project.

I’m particularly interested in understanding more about the training data used in the project. I fully understand that some details may not be publicly shareable, but I would greatly appreciate it if you could clarify any of the following points that you are comfortable discussing:

  1. Data availability

    • Is there any plan to release the training data, or a subset / example version of it, either now or in the future?
  2. Data processing pipeline

    • Are there plans to open-source any of the code related to data processing, such as data collection, annotation, filtering, or preprocessing?
  3. Data composition
    If possible, could you share high-level information about how the data is composed? For example:

    • The rough proportion of real-world data versus synthetic / simulated / game-based data
    • Whatever public datasets used
    • The types or categories of games / simulation environments involved (at a high level)
  4. Data scale and statistics

    • The approximate scale of the dataset (e.g., number of videos)
    • A rough distribution of video lengths (e.g., short clips vs. long trajectories)

I believe this information would be extremely valuable for researchers and practitioners who are interested in understanding the design choices behind the project, as well as for those attempting to build upon or reproduce similar systems.

Thanks again for your excellent work, and I really appreciate your time and effort in maintaining this project!

Best regards

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions