DocumentzuwiizuFree

pdf-to-word

Convert PDF files to editable Word documents (.docx). Preserves layout, images, and keeps text editable. Handles CMYK, ICCBased, and broken colorspace images. Ideal for translation workflows.

View on GitHub ↗</>github.com/zuwiizu/pdf-to-word Yours? Claim it ↗

§ 01 — Stats

Stars1

Prior1788

Quality72.0

Score—

Tasks—

§ 02 — Install

Get pdf-to-word.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install pdf-to-word

Or clone the repo

$git clone https://github.com/zuwiizu/pdf-to-word.git

Or copy the skill folder manually

$cp -r pdf-to-word/ ~/.claude/skills/pdf-to-word/

More Versuz picks

★ Featured$1.99

vz-scrape-runner

Web

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge pdf-to-word↵

Quality 72.0 ·The file provides a comprehensive guide on converting PDF to Word documents, but its clarity and structure suffer from repetitive content and could be improved with more concise language and better organization. Overall, it remains a useful resource for those needing this specific conversion capability.

Show SKILL.md content (~767 tokens)

---
name: pdf-to-word
description: Convert PDF files to editable Word documents (.docx). Preserves layout, images, and keeps text editable. Handles CMYK, ICCBased, and broken colorspace images. Ideal for translation workflows.
user-invocable: true
allowed-tools: Bash, Read, Write, Glob, Grep
argument-hint: [path-to-pdf-or-folder]
---

# PDF to Editable Word Document Converter

Convert PDF files into editable Word documents where text remains as real text (not drawings/images). Images and layout are preserved.

## Requirements

Install dependencies if not already present:

```bash
pip3 install pymupdf==1.24.14 pdf2docx python-docx Pillow
```

**Important:** pdf2docx requires pymupdf < 1.25 for compatibility.

## Conversion Steps

1. **Identify the input**: Check if `$ARGUMENTS` is a single PDF file, a folder of PDFs, or a zip file containing PDFs.

2. **If zip file**: Extract it first to a temporary directory.

3. **Analyze PDFs**: Open each PDF with pymupdf and check for text extractability and image colorspaces.

4. **Apply the colorspace patch**: Before running pdf2docx, patch `ImagesExtractor.py` to handle CMYK/ICCBased/None colorspace images. The patch adds PIL fallback to three methods:
   - `_to_raw_dict()` - wraps `image.tobytes()` in try/except, falls back to PIL conversion
   - `_pixmap_to_cv_image()` - same PIL fallback for opencv conversion
   - `_recover_pixmap()` - changes CMYK detection from string match to `pix.n == 4`, adds PIL handling for None colorspace

   See [colorspace-patch.md](colorspace-patch.md) for the exact patch code.

5. **Convert with pdf2docx**:
```python
from pdf2docx import Converter
cv = Converter(pdf_path)
cv.convert(output_path)
cv.close()
```

6. **Output**: Save `.docx` files to a `converted_word_docs/` directory next to the input.

## Colorspace Patch Details

The patch is needed because many professionally designed PDFs use:
- **ICCBased CMYK** profiles (e.g., "U.S. Web Coated (SWOP) v2") - pymupdf can't convert these to PNG directly
- **None colorspace** with n=1 (grayscale images with broken metadata)

The fix: catch `ValueError`/`RuntimeError` from `pixmap.tobytes()` and fall back to PIL:
- n=1 -> PIL "L" mode (grayscale)
- n=3 -> PIL "RGB" mode
- n=4 -> PIL "CMYK" mode, then convert to RGB

## File Locations

The patch must be applied to:
```
{site-packages}/pdf2docx/image/ImagesExtractor.py
```

Find the path with:
```python
python3 -c "import pdf2docx; print(pdf2docx.__file__)"
```

## Notes

- pdf2docx produces the best layout-preserving results for designed/illustrated PDFs
- File sizes will be large for image-heavy PDFs (the original images are embedded)
- All text in the output is editable and selectable - ready for translation
- The converter handles: text extraction, image preservation, CMYK conversion, heading detection, font styling

pdf-to-word

Get pdf-to-word.

vz-bench-debug

vz-scrape-runner

Think you can beat it?