Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install pdf-to-wordgit clone https://github.com/zuwiizu/pdf-to-word.gitcp -r pdf-to-word/ ~/.claude/skills/pdf-to-word/---
name: pdf-to-word
description: Convert PDF files to editable Word documents (.docx). Preserves layout, images, and keeps text editable. Handles CMYK, ICCBased, and broken colorspace images. Ideal for translation workflows.
user-invocable: true
allowed-tools: Bash, Read, Write, Glob, Grep
argument-hint: [path-to-pdf-or-folder]
---
# PDF to Editable Word Document Converter
Convert PDF files into editable Word documents where text remains as real text (not drawings/images). Images and layout are preserved.
## Requirements
Install dependencies if not already present:
```bash
pip3 install pymupdf==1.24.14 pdf2docx python-docx Pillow
```
**Important:** pdf2docx requires pymupdf < 1.25 for compatibility.
## Conversion Steps
1. **Identify the input**: Check if `$ARGUMENTS` is a single PDF file, a folder of PDFs, or a zip file containing PDFs.
2. **If zip file**: Extract it first to a temporary directory.
3. **Analyze PDFs**: Open each PDF with pymupdf and check for text extractability and image colorspaces.
4. **Apply the colorspace patch**: Before running pdf2docx, patch `ImagesExtractor.py` to handle CMYK/ICCBased/None colorspace images. The patch adds PIL fallback to three methods:
- `_to_raw_dict()` - wraps `image.tobytes()` in try/except, falls back to PIL conversion
- `_pixmap_to_cv_image()` - same PIL fallback for opencv conversion
- `_recover_pixmap()` - changes CMYK detection from string match to `pix.n == 4`, adds PIL handling for None colorspace
See [colorspace-patch.md](colorspace-patch.md) for the exact patch code.
5. **Convert with pdf2docx**:
```python
from pdf2docx import Converter
cv = Converter(pdf_path)
cv.convert(output_path)
cv.close()
```
6. **Output**: Save `.docx` files to a `converted_word_docs/` directory next to the input.
## Colorspace Patch Details
The patch is needed because many professionally designed PDFs use:
- **ICCBased CMYK** profiles (e.g., "U.S. Web Coated (SWOP) v2") - pymupdf can't convert these to PNG directly
- **None colorspace** with n=1 (grayscale images with broken metadata)
The fix: catch `ValueError`/`RuntimeError` from `pixmap.tobytes()` and fall back to PIL:
- n=1 -> PIL "L" mode (grayscale)
- n=3 -> PIL "RGB" mode
- n=4 -> PIL "CMYK" mode, then convert to RGB
## File Locations
The patch must be applied to:
```
{site-packages}/pdf2docx/image/ImagesExtractor.py
```
Find the path with:
```python
python3 -c "import pdf2docx; print(pdf2docx.__file__)"
```
## Notes
- pdf2docx produces the best layout-preserving results for designed/illustrated PDFs
- File sizes will be large for image-heavy PDFs (the original images are embedded)
- All text in the output is editable and selectable - ready for translation
- The converter handles: text extraction, image preservation, CMYK conversion, heading detection, font styling