Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install brycewang-stanford-awesome-agent-skills-for-empirical-research-skills-32-dylantmoore-stata-skill-plugins-stata-c-plugins-skillsgit clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research.gitcp Awesome-Agent-Skills-for-Empirical-Research/SKILL.MD ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-skills-32-dylantmoore-stata-skill-plugins-stata-c-plugins-skills/SKILL.md---
name: stata-c-plugins
description: >-
Develop high-performance C/C++ plugins for Stata using the stplugin.h SDK.
Use when the user asks to create a Stata plugin, write C/C++ code for Stata,
accelerate a Stata command with C, build cross-platform Stata plugins,
or translate/port a Python or R package into Stata. Covers the full
lifecycle: SDK setup, data flow, memory safety, .ado wrappers with
preserve/merge, cross-platform compilation, performance optimization
(pthreads, pre-sorted indices, XorShift RNG), debugging, and distribution
via net install. Also includes a translation workflow for porting Python/R
packages to Stata — wrapping existing C++ backends when available, or
writing C from scratch when not.
---
# Stata C/C++ Plugin Development
Build high-performance C/C++ plugins for Stata. This skill covers the full lifecycle from SDK setup through cross-platform distribution, based on real experience building production Stata plugins for statistical imputation, random forests, string matching, and causal inference.
**This skill assumes macOS (Apple Silicon or Intel) as the development platform.** Build commands, cross-compilation workflows, and Docker instructions are all Mac-oriented. The plugins themselves target all four platforms (macOS ARM64, macOS x86_64, Linux x86_64, Windows x86_64), but the *development environment* is macOS. If you need to develop on Linux or Windows natively, adapt the compilation and Docker sections accordingly.
## How to Approach Every Task
**Before writing any code, enter plan mode.** A good plan covers:
1. **Complete inventory** — every feature, option, and component to build (for translation: exhaustive catalog of the source package's API)
2. **Architecture decisions** — wrap C++ backend vs. write C from scratch vs. pure Stata
3. **Relevant reference files** — identify up front which of this skill's reference files contain info you'll need, and cite them explicitly in the plan steps so they get loaded at the right time:
- `references/translation_workflow.md` — full translation workflow, test repurposing, fidelity audit
- `references/testing_strategy.md` — test layers, reference data generation, Layer 0 (repurpose original tests)
- `references/performance_patterns.md` — pthreads, XorShift RNG, quickselect, pre-sorted indices
- `references/packaging_and_help.md` — .toc/.pkg/.sthlp templates, build scripts
- `references/cpp_plugins.md` — C++ wrapping, extern "C", exception safety, compilation
4. **Phase-by-phase steps** with dependencies between them
5. **For each step:** what gets built, what tests get written, and that the review loop runs before proceeding
6. **For translation projects:** a final fidelity audit as the last step (see `translation_workflow.md`)
**Implement sequentially across components, in parallel within each component.** Once an interface is defined, dispatch independent sub-tasks as parallel subagents (e.g., C plugin implementation, .ado wrapper, and test suite can run simultaneously). Merge their work, run the full test suite, then proceed to the review loop before moving to the next component.
**Run the review loop after every component:**
- Default: dispatch 2-3 review agents in parallel, ideally from different models (e.g., Claude + GPT + Gemini) for diversity of perspective. Use whatever multi-model tools are available in your environment.
- If only one model is available: dispatch 2-3 agents with different review focuses (correctness, completeness, architecture). Different prompts approximate the diversity of different models.
- Each agent reviews the diff, test results, and requirements — instruction: "List any gaps, bugs, or issues. Say LGTM if everything looks correct."
- Fix all issues raised, re-dispatch, loop until all agents say LGTM. Then proceed.
## Wrap First, Write From Scratch Second
**When translating a package, always check for an existing C/C++ backend before writing any algorithm code.** Many R packages have C++ in `src/`. Many Python packages have Cython or vendored C/C++ libraries. Standalone C++ libraries exist for string matching, linear algebra, tree algorithms, and more.
**If a C++ implementation exists, wrap it.** Do not reimplement the algorithm in C. Wrapping gives you identical output (same code path), production-grade performance, and a fraction of the code. The plugin is just a thin `extern "C"` glue layer between Stata's SDK and the library's API. Binary size is irrelevant — statically link everything (`-static-libstdc++ -static-libgcc`) and ship whatever size the binary turns out to be, even 10-15 MB on Windows. Users don't care about plugin file size; they care about correct results.
See `references/cpp_plugins.md` for the full pattern and `references/translation_workflow.md` for the workflow. Working examples of this approach (wrapping C++ backends, multi-plugin dispatching, save/load for scoring on new data) can be found in the repos listed in the project CLAUDE.md under "Example Applications."
For translation projects, also: repurpose the original package's test suite and data (see `references/testing_strategy.md` Layer 0), write additional Stata-specific tests, and end the plan with a multi-agent fidelity audit. See `references/translation_workflow.md` for the complete workflow.
## The Plugin SDK
Download `stplugin.h` and `stplugin.c` from: https://www.stata.com/plugins/
These two files define the interface between your C code and Stata:
| Function/Macro | Purpose |
|---------------|---------|
| `SF_vdata(var, obs, &val)` | Read variable value (1-indexed!) |
| `SF_vstore(var, obs, val)` | Write variable value (1-indexed!) |
| `SF_nobs()` | Number of observations in current dataset |
| `SF_nvar()` | Number of variables in the **entire dataset** (not just plugin call) |
| `SF_is_missing(val)` | Check for Stata missing value (`.`) |
| `SV_missval` | The missing value constant |
| `SF_display(msg)` | Print informational text in Stata |
| `SF_error(msg)` | Print red error text in Stata |
**Indexing is 1-based.** Both variable indices and observation indices start at 1, not 0. Off-by-one errors here are silent and catastrophic — you read the wrong variable's data with no warning.
## Memory Safety
**A crash in your plugin kills the entire Stata session.** No save prompt, no recovery. The user loses all unsaved work. This is the single most important thing to internalize.
- Check every `malloc()`/`calloc()` return for `NULL`
- Validate `argc` before accessing `argv[]`
- Build with `-fsanitize=address` during development
- Test on small data first, scale up gradually
- Pre-allocate all memory upfront in `stata_call()`, free at the end
## The stata_call() Entry Point
Every plugin implements one function. **Plugins can also be written in C++** — the entry point just needs `extern "C"` linkage so Stata can find it; everything else can be full C++. The obvious case for C++ is when existing C++ code is available to wrap (e.g., an R package's `src/` directory). C++ also helps when you need complex data structures or threading via `std::thread`. For practical C++ guidance — the `extern "C"` pattern, exception safety, compilation commands, wrapping libraries — see `references/cpp_plugins.md`. The rest of this file focuses on C because it's the simpler default.
```c
#include "stplugin.h"
// For C++ plugins, wrap the entry point with extern "C":
// extern "C" {
// STDLL stata_call(int argc, char *argv[]) { ... }
// }
STDLL stata_call(int argc, char *argv[]) {
// 0. Validate arguments BEFORE accessing argv[]
if (argc < 3) {
SF_error("myplugin requires 3 arguments: n_train n_test seed\n");
return 198; // Stata's "syntax error" code
}
// 1. Parse arguments (all strings — use atoi/atof)
int n_train = atoi(argv[0]);
int n_test = atoi(argv[1]);
int seed = atoi(argv[2]);
// 2. Get dimensions
ST_int nobs = SF_nobs();
// CAUTION: SF_nvar() returns ALL variables in the dataset, not just
// the ones passed to `plugin call`. If the .ado creates tempvars
// (touse, merge_id, etc.) the count will be higher than expected.
// Pass the variable count via argv instead of relying on SF_nvar().
int p = atoi(argv[3]); // safer: pass feature count explicitly
// 3. Allocate memory
double *X = calloc(nobs * p, sizeof(double));
double *y = calloc(nobs, sizeof(double));
double *pred = calloc(nobs, sizeof(double));
if (!X || !y || !pred) {
SF_error("myplugin: out of memory\n");
if (X) free(X); if (y) free(y); if (pred) free(pred);
return 909;
}
// 4. Read data from Stata (1-indexed!)
ST_double val;
for (ST_int obs = 1; obs <= nobs; obs++) {
SF_vdata(1, obs, &val); // var 1 = depvar
y[obs-1] = val;
for (int j = 0; j < p; j++) {
SF_vdata(j + 2, obs, &val); // vars 2..nvars-1 = features
X[(obs-1) * p + j] = val;
}
}
// 5. Run your algorithm
int rc = my_algorithm(X, y, pred, n_train, n_test, p, seed);
if (rc != 0) {
SF_error("myplugin: algorithm failed\n");
free(X); free(y); free(pred);
return 909;
}
// 6. Write results back to Stata
for (ST_int obs = 1; obs <= nobs; obs++) {
SF_vstore(nvars, obs, pred[obs-1]); // last var = output
}
free(X); free(y); free(pred);
return 0; // 0 = success
}
```
### Return Codes
- `0` — success
- `198` — syntax error (bad arguments)
- `909` — insufficient memory
- `601` — file not found
- Any non-zero triggers a Stata error
## The .ado Wrapper Pattern
Users never call `plugin call` directly. An `.ado` file provides the Stata-native interface.
### The Preserve/Merge Pattern
This is the core pattern for plugins that operate on a subset of data:
```stata
program define mycommand, rclass
syntax varlist(min=2) [if] [in], GENerate(name) [SEED(integer 12345) REPlace]
gettoken depvar indepvars : varlist
if "`replace'" != "" {
capture drop `generate'
}
confirm new variable `generate'
// Mark sample: novarlist ALLOWS missing depvar (critical for imputation)
marksample touse, novarlist
markout `touse' `indepvars' // but DO exclude missing predictors
// Stable merge key — create BEFORE any sorting or subsetting
tempvar merge_id
quietly gen long `merge_id' = _n
// Count subsets
quietly count if `touse' & !missing(`depvar')
local n_train = r(N)
quietly count if `touse' & missing(`depvar')
local n_test = r(N)
// Create output variable (all missing initially)
quietly gen double `generate' = .
// Preserve, subset, call plugin
preserve
quietly keep if `touse'
// Sort if plugin requires it (donors first, test second)
tempvar sort_order
quietly gen `sort_order' = missing(`depvar')
quietly sort `sort_order'
// Call plugin
plugin call myplugin `depvar' `indepvars' `generate', ///
`n_train' `n_test' `seed'
// Save results and restore
tempfile results
quietly keep `merge_id' `generate'
quietly save `results'
restore
// Merge predictions back (update replaces missing with non-missing)
quietly merge 1:1 `merge_id' using `results', nogenerate update
end
```
**Why `update` works:** The `generate` variable is all-missing before preserve. After restore, it's still all-missing. The `update` option replaces missing values with non-missing ones from the merge file. The `replace` option is handled earlier via `capture drop`, so by merge time the variable is always freshly created.
### Plugin Sorting Contract
**CRITICAL:** Some plugins expect data sorted a specific way (training rows first, test rows second). Others handle missing data internally. Sorting mismatches are among the most dangerous bugs — the plugin silently reads the wrong data, producing garbage output with no error message. A mismatched sort order can drop prediction quality dramatically (e.g., correlation going from 0.99 to 0.38) because the plugin treats test observations as training data and vice versa.
- If the plugin checks `SF_is_missing()` internally: do NOT sort in the .ado wrapper
- If the plugin expects `n_train` contiguous rows then `n_test` rows: sort by `missing(depvar)` before calling
Document which pattern your plugin uses.
### Plugin Loading (Cross-Platform)
Use the **gtools-style OS detection pattern**. This detects the OS via `c(os)` and constructs a bare filename. The bare filename is resolved via Stata's adopath, which is reliable across all platforms.
```stata
/* ---- Load plugin (gtools-style: detect OS, bare filename) ---- */
if ( inlist("`c(os)'", "MacOSX") | strpos("`c(machine_type)'", "Mac") ) local c_os_ macosx
else local c_os_: di lower("`c(os)'")
cap program drop myplugin
program myplugin, plugin using("myplugin_`c_os_'.plugin")
```
This resolves to `myplugin_macosx.plugin`, `myplugin_windows.plugin`, or `myplugin_unix.plugin` depending on platform.
**WARNING — DO NOT use `findfile` + absolute paths.** The following pattern is BROKEN on Windows and must never be used:
```stata
* BROKEN — DO NOT USE
capture findfile myplugin.plugin
capture program myplugin, plugin using("`r(fn)'")
```
`findfile` returns an absolute path (e.g., `C:\ado\plus\m\myplugin.plugin`). On Windows, Stata's `LoadLibrary` call fails when given certain absolute paths via `using()`. The gtools-style pattern avoids this by passing a **bare filename** (no path), which Stata resolves via the adopath — exactly how gtools, ftools, and other major packages work.
Similarly, **do not use a nested if/else cascade** trying each `platform-arch` suffix. This was the old pattern in several packages and fails for the same reason if `findfile` is involved, plus it's fragile and verbose.
**Plugin file naming:** `pluginname_os.plugin` where `os` is one of `macosx`, `unix`, `windows`. Examples: `qrf_plugin_macosx.plugin`, `grf_plugin_windows.plugin`.
**Note:** `clear all` wipes loaded plugin definitions. If a test script starts with `clear all`, all `program ... plugin` definitions are gone. Reload them.
## Cross-Platform Compilation
Build for three platforms (ARM Macs run x86_64 via Rosetta, so one macOS binary suffices). Install the Windows cross-compiler first: `brew install mingw-w64`.
| Target OS | Output name suffix | Compiler | `-D` flag | Link flag | pthreads |
|-----------|-------------------|----------|-----------|-----------|----------|
| macOS (ARM64) | `_macosx` | `gcc -arch arm64` | `-DSYSTEM=APPLEMAC` | `-bundle` | `-pthread` |
| Linux (x86_64) | `_unix` | `gcc` | `-DSYSTEM=OPUNIX` | `-shared` | `-pthread` |
| Windows (x86_64) | `_windows` | `x86_64-w64-mingw32-gcc` | `-DSYSTEM=STWIN32` | `-shared` | `-lwinpthread` |
All platforms: `-O3 -fPIC` for release, add `-g -fsanitize=address` for development.
**For C++ plugins:** use `g++` instead of `gcc`. Add `-std=c++` at the version the library requires (check its docs — C++11, C++14, and C++17 are all common). Header-only C++ libraries can be vendored into `c_source/` and included with `-I.`. Always use `-static-libstdc++ -static-libgcc` on Windows and Linux.
Naming convention: `pluginname_os.plugin` (e.g., `qrf_plugin_macosx.plugin`, `grf_plugin_windows.plugin`). The `os` suffix must match what the gtools-style loader produces: `macosx`, `unix`, or `windows`.
macOS note: use `-bundle`, NOT `-shared`. This is a common mistake.
### Linux from macOS (Docker Required)
There is no native Linux cross-compiler on macOS. Use Docker via Colima (`brew install colima docker`, then `colima start`). Build with a one-liner:
```bash
docker run --rm --platform linux/amd64 -v "$(pwd):/build" -w /build ubuntu:18.04 \
bash -c "apt-get update -qq && apt-get install -y -qq g++ gcc make > /dev/null 2>&1 && make linux"
```
**glibc compatibility:** Build on Ubuntu 18.04 for maximum compatibility (requires only GLIBC 2.14, works on any Linux from ~2012+). Building on Ubuntu 22.04+ requires GLIBC 2.34, which excludes RHEL 8, Ubuntu 20.04, and many HPC environments.
## Performance Optimization
See `references/performance_patterns.md` for detailed code examples of:
1. **Pre-sorted feature indices** — Sort feature values once, scan linearly at each tree node. O(n) per split instead of O(n log n).
2. **Precomputed distance norms** — Exploit ||a-b||^2 = ||a||^2 + ||b||^2 - 2*a'b for KNN.
3. **Quickselect** — O(n) partial sort for finding k-th nearest neighbor.
4. **Parallel ensemble training (pthreads)** — Train multiple models concurrently. Each thread gets its own data copy and RNG state. **Never call Stata SDK functions (`SF_vdata`, `SF_vstore`, `SF_display`) from worker threads** — read all data on the main thread first, dispatch computation to workers, write results back on the main thread after joining.
5. **XorShift RNG** — C plugins cannot access Stata's internal RNG (`runiform()`). XorShift128+ is fast, statistically sound, and thread-safe (each thread gets its own state). Seed from `argv[]` for reproducibility.
6. **Dense arrays for trees** — Flat node arrays instead of linked lists for cache locality.
## Debugging
Debugging is hard because you can't attach a debugger to Stata's plugin host.
### Strategies
1. **Printf via SF_display():**
```c
char buf[256];
snprintf(buf, sizeof(buf), "Debug: n=%d, p=%d\n", n, p);
SF_display(buf);
```
2. **Write diagnostic files:**
```c
FILE *f = fopen("plugin_debug.log", "w");
fprintf(f, "value at [%d][%d] = %f\n", i, j, val);
fclose(f);
```
3. **Test standalone first.** Write a `main()` that reads CSV and calls your algorithm. Debug with normal tools (gdb, valgrind, sanitizers). Then adapt for the plugin interface.
4. **Build with sanitizers during development:** `-g -fsanitize=address`
5. **Check SF_vdata() return values.** It returns `RC` (0=success). Non-zero means invalid obs/var index.
### Common Failure Modes
| Symptom | Likely Cause |
|---------|-------------|
| Stata crashes silently | Segfault: buffer overflow, bad argv access, NULL deref |
| Plugin returns all missing | Wrong variable count, wrong obs indexing, plugin not loaded |
| Results are garbage | Sorting mismatch, 0-vs-1 indexing error, unnormalized inputs |
| "plugin not found" | Wrong filename, `clear all` wiped definition, wrong platform |
| Works on Mac, fails on Linux | Integer size difference, use `int32_t`/`int64_t` from `<stdint.h>` |
## Packaging and Distribution
**Use platform-specific `.pkg` files** so users only download the binary for their OS. Stata's `net install` has no conditional logic, so the way to avoid shipping all 4 binaries to every user is to offer separate packages per platform. All packages install the same `.ado` and `.sthlp` files — only the `.plugin` binary differs.
```
mypackage/
├── stata.toc # lists all package variants
├── mypackage.pkg # all platforms (for users who don't care)
├── mypackage_mac.pkg # macOS only
├── mypackage_linux.pkg # Linux only
├── mypackage_win.pkg # Windows only
├── mycommand.sthlp # overview help file (short name!)
├── mycommand.ado # user-facing command
├── myplugin_macosx.plugin
├── myplugin_unix.plugin
├── myplugin_windows.plugin
└── c_source/ # NOT distributed, for building
├── build.py
├── stplugin.c
├── stplugin.h
└── algorithm.c
```
Users install their platform's package:
```stata
* macOS
net install mypackage_mac, from("https://raw.githubusercontent.com/user/repo/main") replace
* Linux
net install mypackage_linux, from("https://raw.githubusercontent.com/user/repo/main") replace
* Windows
net install mypackage_win, from("https://raw.githubusercontent.com/user/repo/main") replace
```
All platform binaries ship via the all-platform .pkg, or users can install platform-specific packages. Stata loads only the matching plugin at runtime via gtools-style OS detection. Windows C++ binaries can be 10-15MB due to static linking, which is normal.
See `references/packaging_and_help.md` for `.toc`, `.pkg`, `.sthlp` templates and SMCL formatting.
## Common Pitfalls
1. **Sorting destroys merge keys.** If you sort inside `preserve`/`restore`, the merge_id linkage breaks. Always create merge_id BEFORE preserve.
2. **1-indexed everything.** `SF_vdata(var, obs, &val)` — both var and obs start at 1. Off-by-one errors are silent.
3. **`marksample` excludes missing by default.** For imputation (where missing depvar IS the point), use `marksample touse, novarlist`.
4. **macOS `c(os)` returns "MacOSX".** Use the gtools pattern: `inlist("`c(os)'", "MacOSX") | strpos("`c(machine_type)'", "Mac")` to detect Mac. For other platforms, `lower(c(os))` gives `"windows"` or `"unix"`.
5. **argv[] has no bounds checking.** Accessing `argv[3]` when `argc == 2` is a segfault. Always check `argc` first.
6. **`clear all` wipes plugins.** Reload plugin definitions after `clear all` in test scripts.
7. **Only the first `program define` in a .ado file is auto-discovered.** Subprograms need their own .ado files or explicit `run` to load.
8. **Normalize inputs when the algorithm requires it** (neural networks, gradient-based methods, distance-based methods like KNN). Scale to mean=0, sd=1 in the .ado wrapper, denormalize predictions after. The plugin should receive clean, normalized data — let the .ado handle the scaling.
9. **pthreads on Windows needs `-lwinpthread`.** Use conditional linker flags.
10. **Memory errors crash Stata with no recovery.** Pre-allocate everything, check every allocation, build with sanitizers during development.
11. **glibc version mismatch.** Building Linux plugins on a modern distro produces binaries that won't load on older systems. Use Ubuntu 18.04 in Docker for maximum compatibility.
12. **`SF_nvar()` returns total dataset variables.** It counts ALL variables in the dataset, not just the ones in the `plugin call` varlist. If the .ado creates tempvars (`touse`, `merge_id`, sort keys), the count will be higher than expected. Never use `SF_nvar()` to validate argument counts — pass the expected count via `argv` instead.
13. **`findfile` + absolute paths breaks on Windows.** `findfile` returns an absolute path that Stata's `LoadLibrary` can't resolve on Windows. Use the gtools-style OS detection pattern instead (see Plugin Loading section above) — it constructs a bare filename that Stata resolves via the adopath.
## Naming Conventions
- Use `method()` not `model()` for method selection options
- Use `generate()` (abbreviation `gen()`) for output variable naming
- Use `replace` as a flag option, not `replace()`
- Plugin files: `algorithm_plugin_os.plugin` where os is `macosx`, `unix`, or `windows`
- .ado files: lowercase, underscores for multi-word
- Stata option convention: options lowercase, abbreviations capitalized (`GENerate`, `MAXDepth`)
- Target Stata 14.0+ (`version 14.0`) for plugin support
- **Help files use the short command name, not the repo name.** If the repo is called `mypackage_stata`, the overview help file should still be `mypackage.sthlp` (so `help mypackage` works). Don't append "stata" to help file or command names — the user is already in Stata.