Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ source venv/bin/activate

# 3. 下载实验数据(可选,部分实验不需要)
cd modules/common
python datasets.py --download-all
python data_sources.py --download-all
cd ../..
```

Expand Down
2 changes: 1 addition & 1 deletion docs/guide/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ source venv/bin/activate

# 3. 下载实验数据(可选,部分实验不需要)
cd modules/common
python datasets.py --download-all
python data_sources.py --download-all
cd ../..
```

Expand Down
2 changes: 1 addition & 1 deletion en/ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ source venv/bin/activate

# 3. Download experiment data (optional, some experiments do not need it)
cd modules/common
python datasets.py --download-all
python data_sources.py --download-all
cd ../..
```

Expand Down
2 changes: 1 addition & 1 deletion en/docs/guide/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ source venv/bin/activate

# 3. Download experiment datasets (optional)
cd modules/common
python datasets.py --download-all
python data_sources.py --download-all
cd ../..
```

Expand Down
2 changes: 1 addition & 1 deletion en/modules/01-foundation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ For each module:

A: Check the following:
1. Did you activate the virtual environment? `source venv/bin/activate`
2. Did you download data? `cd modules/common && python datasets.py --download-all`
2. Did you download data? `cd modules/common && python data_sources.py --download-all`
3. Are you in the correct folder? Experiments must run inside `experiments/`

**Q: Experiments are too slow?**
Expand Down
31 changes: 27 additions & 4 deletions en/modules/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,23 @@ _(Planned)_

---

## 📋 System Requirements

### Python Version
- **Recommended**: Python 3.10+
- **Minimum**: Python 3.10

Some utility code uses Python 3.10+ union type syntax (e.g., `str | list`). Earlier versions will not work.

### Dependencies
```bash
pip install torch requests datasets matplotlib numpy
```

See: [Environment Setup Guide](../docs/guide/environment-setup.md)

---

## ⚡ Quick Start

### Environment setup
Expand All @@ -68,7 +85,7 @@ source venv/bin/activate

# 2. Download experiment data (~60 MB)
cd modules/common
python datasets.py --download-all
python data_sources.py --download-all
```

### 30-minute quick experience
Expand Down Expand Up @@ -170,10 +187,10 @@ Each design choice answers:

Shared tools live in `modules/common/`:

### datasets.py - Dataset manager
### data_sources.py - Dataset manager

```python
from modules.common.datasets import get_experiment_data
from modules.common.data_sources import get_experiment_data

# TinyShakespeare
text = get_experiment_data('shakespeare')
Expand Down Expand Up @@ -204,7 +221,13 @@ from modules.common.visualization import (
)
```

See docstrings in each file for details.
See docstrings in each file or [`modules/common/README.md`](../modules/common/README.md) for details.

#### ⚠️ Migration Notice

**2026-02**: `datasets.py` has been renamed to `data_sources.py` to avoid naming conflict with HuggingFace datasets library.

For detailed migration guide, see [modules/common/README.md](../modules/common/README.md) or [PR #20](https://github.com/joyehuang/minimind-notes/pull/20).

---

Expand Down
2 changes: 1 addition & 1 deletion modules/01-foundation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ keywords: Transformer基础组件, 归一化, 位置编码, 注意力机制, 前

A: 检查以下几点:
1. 是否激活了虚拟环境? `source venv/bin/activate`
2. 是否下载了数据? `cd modules/common && python datasets.py --download-all`
2. 是否下载了数据? `cd modules/common && python data_sources.py --download-all`
3. 是否在正确的目录?实验需要在 `experiments/` 目录下运行

**Q: 实验太慢怎么办?**
Expand Down
52 changes: 48 additions & 4 deletions modules/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,23 @@ _(后续扩展)_

---

## 📋 系统要求

### Python 版本
- **推荐**: Python 3.10+
- **最低**: Python 3.10

部分工具代码使用了 Python 3.10+ 的类型注解语法(如 `str | list`),低于此版本将无法运行。

### 依赖安装
```bash
pip install torch requests datasets matplotlib numpy
```

详见:[环境配置指南](../docs/guide/environment-setup.md)

---

## ⚡ 快速开始

### 准备环境
Expand All @@ -68,7 +85,7 @@ source venv/bin/activate

# 2. 下载实验数据(约 60 MB)
cd modules/common
python datasets.py --download-all
python data_sources.py --download-all
```

### 30 分钟快速体验
Expand Down Expand Up @@ -170,10 +187,10 @@ python exp_xxx.py --help

模块提供了以下通用工具(位于 `modules/common/`):

### datasets.py - 数据集管理
### data_sources.py - 数据集管理

```python
from modules.common.datasets import get_experiment_data
from modules.common.data_sources import get_experiment_data

# 获取 TinyShakespeare
text = get_experiment_data('shakespeare')
Expand Down Expand Up @@ -204,7 +221,34 @@ from modules.common.visualization import (
)
```

详细文档见各文件的 docstring。
详细文档见各文件的 docstring 或 [`modules/common/README.md`](./common/README.md)。

#### ⚠️ 迁移说明

**2026-02**: `datasets.py` 已重命名为 `data_sources.py`

如果你的代码使用了旧的导入方式:
```python
# 旧代码(会报错)
from modules.common.datasets import get_experiment_data
```

请更新为:
```python
# 新代码
from modules.common.data_sources import get_experiment_data
```

命令行使用也需要更新:
```bash
# 旧命令
python datasets.py --download-all

# 新命令
python data_sources.py --download-all
```

**变更原因**: 避免与 HuggingFace `datasets` 库命名冲突,详见 [通用工具文档](./common/README.md#重要变更说明)

---

Expand Down
122 changes: 122 additions & 0 deletions modules/common/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# 通用工具 (Common Utilities)

本目录包含所有实验模块共享的工具代码。

## 📋 系统要求

### Python 版本
- **要求**: Python 3.10+

**注意**: 代码使用了 Python 3.10+ 的类型联合语法(`str | list`),低于此版本将无法运行。

### 依赖库
- `torch` - PyTorch 深度学习框架
- `requests` - HTTP 请求(用于数据下载)
- `datasets` - HuggingFace datasets 库(用于 TinyStories 下载)

安装方法:
```bash
pip install torch requests datasets
```

## 📦 可用工具

### data_sources.py - 数据集管理

提供统一的实验数据接口,支持:
- TinyShakespeare(经典字符级数据,1MB)
- TinyStories(现代英文,支持取子集)
- 合成数据(用于可视化实验)

**使用示例**:
```python
from modules.common.data_sources import get_experiment_data

# 获取 TinyShakespeare
text = get_experiment_data('shakespeare')

# 获取 TinyStories 子集(10MB)
texts = get_experiment_data('tinystories', size_mb=10)

# 生成合成数据
text = get_experiment_data('synthetic', size_mb=1)
```

**命令行使用**:
```bash
cd modules/common

# 下载所有数据集
python data_sources.py --download-all

# 测试单个数据集
python data_sources.py --dataset shakespeare
```

### experiment_base.py - 实验基类

提供统一的实验框架,包括:
- 自动设备检测(CPU/MPS/CUDA)
- 结果保存(图表 + 指标)
- 进度显示
- 可复现性(固定随机种子)

**使用示例**:
```python
from modules.common.experiment_base import Experiment

class MyExperiment(Experiment):
def __init__(self):
super().__init__(
name="my_experiment",
output_dir="experiments/results"
)

def run(self):
# 你的实验代码
metrics = {'accuracy': 0.95}
self.print_metrics(metrics)
self.save_metrics(metrics)

exp = MyExperiment()
exp.run()
```

### visualization.py - 可视化工具

提供常用的可视化函数。

**注意**: 此文件目前尚未创建,计划在后续模块中添加。

## ⚠️ 重要变更说明

### datasets.py 已重命名为 data_sources.py (2026-02)

**原因**: 避免与 HuggingFace `datasets` 库的命名冲突,该冲突会导致 TinyStories 数据集下载失败。

**背景**: Python 模块搜索时优先查找当前目录,如果存在本地 `datasets.py`,会导致 `from datasets import load_dataset` 错误导入本地文件而非 HuggingFace 库。

**迁移方法**:

| 旧代码 | 新代码 |
|--------|--------|
| `from modules.common.datasets import ...` | `from modules.common.data_sources import ...` |
| `python datasets.py --download-all` | `python data_sources.py --download-all` |

**注意**:
- `datasets.py` 文件已完全删除(不再存在于仓库中)
- 使用旧导入方式会收到标准的 `ModuleNotFoundError`
- 所有官方文档和实验代码已更新为新文件名
- Git 历史中仍可通过 `git log -- modules/common/datasets.py` 追溯旧文件

**相关信息**:
- 问题追踪: GitHub Issue #19
- 详细讨论: GitHub Pull Request #20

## 📝 贡献指南

在添加新工具时,请:
1. 在本 README 中添加工具说明
2. 在文件头部添加清晰的文档字符串
3. 提供使用示例
4. 确保工具是通用的,可以被多个模块复用
Loading