Skip to content

mufassirkazi/docflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Built with Pochi by TabbyML

⚡ Completely built with Pochi


DocFlow

DocFlow is a CLI tool that fetches documents from Lark (Feishu), Notion, or Google Docs, converts them into a structured AST, uploads all embedded images/assets to S3 (or local disk), and outputs production-ready .mdx files — ready to drop into any Next.js, Docusaurus, or MDX-powered docs site.

┌─────────────┐     fetch      ┌─────────┐    upload    ┌──────────┐    write    ┌──────────┐
│ Lark/Notion │ ─────────────► │   AST   │ ───────────► │  S3 / FS │ ──────────► │  .mdx    │
│ Google Docs │                │ (typed) │              │ (assets) │             │ (output) │
└─────────────┘                └─────────┘              └──────────┘             └──────────┘

Features

  • 3 source adapters — Lark (Feishu), Notion, and Google Docs
  • 2 asset backends — AWS S3 (and S3-compatible: R2, MinIO) or local disk
  • Full MDX output — frontmatter, headings, code blocks, tables, callouts, images
  • Concurrent asset uploads — configurable parallelism with retries
  • Git integration — optional auto-commit after publish
  • Dry-run mode — inspect the parsed AST without writing any files
  • Env-variable overrides — no secrets ever need to be hardcoded in config
  • Zero runtime deps beyond Node.js — runs on any machine with Node 18+

Table of Contents


Installation

# Clone and install
git clone https://github.com/YOUR_USERNAME/docflow.git
cd docflow
npm install

# Build
npm run build

# Link globally (optional)
npm link

Or use directly via npx once published:

npx docflow fetch <docId>

Quick Start

1. Copy the example config:

cp docflow.config.yaml my-project/docflow.config.yaml

2. Fill in your credentials (or set environment variables — see below):

adapter: lark

lark:
  appId: "YOUR_LARK_APP_ID"
  appSecret: "YOUR_LARK_APP_SECRET"

assets:
  backend: s3
  s3:
    bucket: "my-docs-bucket"
    region: "us-east-1"

output:
  dir: "./content/docs"

3. Fetch a document:

docflow fetch <your-lark-doc-id>

That's it — your .mdx file is written to ./content/docs/ with all images uploaded and replaced with permanent S3 URLs.


Configuration

DocFlow is configured via docflow.config.yaml in your project root. Every value can also be overridden with environment variables (see Environment Variables).

Lark (Feishu)

Create a custom app in the Lark Open Platform (open.feishu.cn):

  1. Go to My Apps → Create App
  2. Enable the Docs API permissions: docx:document:readonly, drive:drive:readonly
  3. Copy your App ID and App Secret
lark:
  appId: "YOUR_LARK_APP_ID"           # or env DOCFLOW_LARK_APP_ID
  appSecret: "YOUR_LARK_APP_SECRET"   # or env DOCFLOW_LARK_APP_SECRET

The document must be shared with the app (or the app must have tenant-wide read access).


Notion

Create an internal integration at notion.so/my-integrations:

  1. Click New Integration, pick a name, select your workspace
  2. Copy the Integration Token (starts with secret_...)
  3. Share each page/database with your integration via the Share menu in Notion
notion:
  apiKey: "secret_..."   # or env NOTION_API_KEY

Google Docs

Uses a Service Account (no OAuth, no user login required):

  1. In Google Cloud Console, create a service account
  2. Give it no project roles (it just needs Docs API access)
  3. Download the JSON key file
  4. Enable the Google Docs API and Google Drive API in your project
  5. Share your document with the service account's client_email
googleDocs:
  serviceAccountPath: "./service-account.json"  # or env GOOGLE_APPLICATION_CREDENTIALS

⚠️ Never commit your service-account.json to git. It is already in .gitignore.


Asset Storage

S3 (recommended for production)

assets:
  backend: s3
  concurrency: 5          # max parallel uploads

  s3:
    bucket: "YOUR_S3_BUCKET_NAME"
    region: "us-east-1"
    keyPrefix: "docs/"                  # optional prefix for all keys
    publicBaseUrl: "https://cdn.example.com"   # optional CDN base URL
    # endpoint: "https://..."           # optional: R2, MinIO, etc.

AWS credentials are picked up from environment variables (AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY) or from the standard AWS credential chain (IAM role, ~/.aws/credentials, etc.).

Local Disk (for development)

assets:
  backend: local

  local:
    outputDir: "./public/assets"
    publicBaseUrl: "/assets"

Git Integration

DocFlow can auto-commit the output .mdx files after publishing:

git:
  autoCommit: false
  commitMessage: "docs: publish {{title}} ({{date}})"

Or pass --commit at the CLI to commit a specific run without changing the config.


CLI Reference

Usage: docflow [command] [options]

Commands:
  fetch <docId>    Fetch a document and publish it as MDX
  adapters         List registered source adapters
  backends         List registered asset backends

Options for `fetch`:
  --config <path>    Path to docflow.config.yaml (default: ./docflow.config.yaml)
  --adapter <name>   Source adapter: lark | notion | google-docs
  --backend <name>   Asset backend: s3 | local
  --output <dir>     Output directory for .mdx files
  --overwrite        Overwrite existing .mdx files (default: false)
  --commit           Auto-commit output files via git
  --dry-run          Parse and show summary without writing files or uploading assets

Examples

# Fetch a Lark doc
docflow fetch ABC123XYZ

# Dry-run: inspect the parsed AST without any side-effects
docflow fetch ABC123XYZ --dry-run

# Use a different config file
docflow fetch ABC123XYZ --config ./configs/prod.yaml

# Override adapter and backend on the fly
docflow fetch ABC123XYZ --adapter notion --backend local --output ./out

# Fetch and auto-commit the result
docflow fetch ABC123XYZ --commit

# List what adapters are registered
docflow adapters

# List asset backends
docflow backends

Environment Variables

All credentials can (and should) be provided via environment variables instead of the config file:

Variable Description
DOCFLOW_LARK_APP_ID Lark app ID
DOCFLOW_LARK_APP_SECRET Lark app secret
NOTION_API_KEY Notion internal integration token
GOOGLE_APPLICATION_CREDENTIALS Path to Google service account JSON
AWS_ACCESS_KEY_ID AWS access key ID
AWS_SECRET_ACCESS_KEY AWS secret access key
DOCFLOW_S3_BUCKET S3 bucket name (overrides config)
DOCFLOW_S3_REGION S3 region (overrides config)
DOCFLOW_S3_KEY_PREFIX S3 key prefix (overrides config)
DOCFLOW_S3_PUBLIC_BASE_URL S3 / CDN public base URL (overrides config)

Example .env file (never commit this):

DOCFLOW_LARK_APP_ID=cli_xxxxxxxxxxxx
DOCFLOW_LARK_APP_SECRET=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
DOCFLOW_S3_BUCKET=my-docs-assets
DOCFLOW_S3_REGION=us-east-1

Output Format

Each fetched document produces a single .mdx file:

---
title: My Document Title
slug: my-document-title
date: '2026-01-15'
---

Content paragraph here...

## Heading 2

![Image alt](https://your-bucket.s3.us-east-1.amazonaws.com/docs/image-abc123.png)

| Column A | Column B |
|---|---|
| Value 1 | Value 2 |
  • Frontmattertitle, slug, date
  • Headings — H1–H6
  • Text formatting — bold, italic, strikethrough, inline code
  • Code blocks — with language detection
  • Images — uploaded to your asset backend; URLs replaced with permanent links
  • Tables — GFM markdown tables
  • Callouts / quotes — blockquote format
  • Lists — ordered and unordered, nested

Architecture

src/
├── cli.ts                      # CLI entry point (Commander.js)
├── core/
│   ├── config.ts               # YAML config loader + env overrides
│   ├── pipeline.ts             # Main orchestration: fetch → upload → publish
│   ├── registry.ts             # Plugin registry for adapters/backends/publishers
│   └── ast.ts                  # Shared AST node types
├── adapters/
│   ├── ISourceAdapter.ts       # Adapter interface
│   ├── lark/                   # Lark (Feishu) adapter
│   ├── notion/                 # Notion adapter
│   └── google-docs/            # Google Docs adapter
├── assets/
│   ├── AssetManager.ts         # Concurrent upload orchestration
│   ├── IAssetBackend.ts        # Backend interface
│   └── backends/
│       ├── S3Backend.ts        # AWS S3 / S3-compatible
│       └── LocalDiskBackend.ts # Local filesystem
├── publishers/
│   ├── IPublisher.ts           # Publisher interface
│   └── MdxPublisher.ts         # MDX serializer
└── git/
    └── GitIntegration.ts       # simple-git auto-commit

Adding a new adapter

Implement ISourceAdapter and register it in cli.ts:

import type { ISourceAdapter, DocflowDocument } from "./adapters/ISourceAdapter.js";

class MyAdapter implements ISourceAdapter {
  info = { name: "my-adapter", version: "1.0.0" };
  async authenticate() { /* ... */ }
  async fetchDocument(id: string): Promise<DocflowDocument> { /* ... */ }
  async resolveAssetUrl(rawUrl: string): Promise<string> { return rawUrl; }
}

Adding a new asset backend

Implement IAssetBackend:

import type { IAssetBackend } from "./assets/IAssetBackend.js";

class MyBackend implements IAssetBackend {
  async upload(buffer: Buffer, filename: string, mimeType: string): Promise<string> {
    // upload and return public URL
  }
}

Contributing

Contributions are welcome! Please open an issue first to discuss larger changes.

# Install dependencies
npm install

# Build
npm run build

# Watch mode during development
npm run build:watch

# Run directly from source (no build needed)
npm run dev -- fetch <docId>

License

ISC License — see LICENSE for details.


Built with ❤️ using Pochi

About

Convert your Notion / Google Doc / Lark Doc documents into Markdown with complete S3 uploads and verification.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors