Catch the structured data.
Documentation • GitHub • PyPI
Catchfly automates schema discovery → structured extraction → normalization from unstructured text at scale. Interchangeable strategies at each stage let you go from raw documents to clean, normalized data with minimal effort.
pip install catchfly[openai,clustering]
from catchfly import Pipeline
from catchfly.demo import load_samples
docs = load_samples("product_reviews")
pipeline = Pipeline.quick(model="gpt-5.4-mini")
results = pipeline.run(
documents=docs,
domain_hint="Electronics product reviews",
normalize_fields=["pros"],
)
results.to_dataframe() # → pandas DataFrame
| Stage | Strategy | Description |
|---|---|---|
| Discovery | SinglePassDiscovery |
One LLM call → JSON Schema from sample docs |
ThreeStageDiscovery |
3-stage progressive refinement (initial → refine → expand) | |
SchemaOptimizer |
PARSE-style iterative field enrichment (descriptions, examples, synonyms) | |
| Extraction | LLMDirectExtraction |
Per-document extraction with tool calling, retries, chunking |
| Normalization | CascadeNormalization |
Chain strategies with confidence-based routing + self-learning |
OntologyMapping |
Embed → NN search → LLM rerank against HPO/custom ontologies | |
LLMCanonicalization |
LLM groups synonyms, map-reduce for large sets (>200 values) | |
EmbeddingClustering |
Embed → HDBSCAN/agglomerative → canonical selection | |
| Infrastructure | SchemaRegistry |
Version, diff, and persist schemas across runs |
Map clinical terms to ontology entries using local SapBERT embeddings (zero API cost) with optional LLM reranking:
from catchfly.normalization import CascadeNormalization, OntologyMapping
from catchfly.providers import SentenceTransformerEmbeddingClient
# SapBERT embeddings — 0.802 Acc@1 on BC5CDR, beats OpenAI embeddings
embed_client = SentenceTransformerEmbeddingClient() # default: SapBERT
normalizer = OntologyMapping(
ontology="hpo",
embedding_client=embed_client,
augment_queries=True, # LLM generates alternative phrasings (+10-20pp recall)
)
result = await normalizer.anormalize(
["seizures", "high temperature", "low muscle tone"],
context_field="phenotype",
)
# result.mapping: {"seizures": "Seizure", "high temperature": "Fever", ...}
# Self-learning cascade — learns from results, cheaper on re-runs
cascade = CascadeNormalization.default(
dictionary={"ALT": "Alanine aminotransferase"},
ontology="hpo",
use_confidence=True, # confidence-based routing between steps
)
result = await cascade.anormalize(values, context_field="phenotype")
cascade.learn(result) # next run resolves known mappings instantly ($0)
Requires: pip install catchfly[embeddings,medical]
pipeline = Pipeline.quick(
model="qwen3.5",
base_url="http://localhost:11434/v1",
)
Works with any OpenAI-compatible endpoint: Ollama, vLLM, LMStudio, llama.cpp.
Each stage works independently — use one, two, or all three:
# Discovery
from catchfly.discovery.single_pass import SinglePassDiscovery
schema = SinglePassDiscovery(model="gpt-5.4-mini").discover(docs, domain_hint="...")
# Extraction (bring your own schema)
from catchfly.extraction.llm_direct import LLMDirectExtraction
records = LLMDirectExtraction(model="gpt-5.4-mini").extract(schema=MyModel, documents=docs)
# Normalization (bring your own data)
from catchfly.normalization.embedding_cluster import EmbeddingClustering
mapping = EmbeddingClustering().normalize(values=["NYC", "New York", "NY"], context_field="city")
Iteratively enrich field descriptions for better extraction and normalization:
from catchfly.discovery.optimizer import SchemaOptimizer
optimizer = SchemaOptimizer(model="gpt-5.4-mini", num_iterations=3)
enriched = optimizer.optimize(schema=MyModel, test_documents=docs[:10])
# enriched.field_metadata has descriptions, examples, synonyms per field
The core novel contribution — bridge schema optimization and normalization:
from catchfly.normalization.kllmeans import KLLMeansClustering
normalizer = KLLMeansClustering(
num_clusters=5,
seed_from_schema=True, # use enriched field descriptions as initial centroids
summarize_every=3, # LLM generates textual centroids every 3 iterations
)
result = normalizer.normalize(
values=messy_values,
context_field="medication",
field_metadata=enriched.field_metadata["medication"],
)
# Cost control
results = pipeline.run(documents=docs, max_cost_usd=20.0)
# Checkpoint/resume (for 1000+ documents)
results = pipeline.run(documents=large_corpus, checkpoint_dir="./state/")
# Error handling
extractor = LLMDirectExtraction(model="gpt-5.4-mini", on_error="collect")
results = extractor.extract(schema=MyModel, documents=docs)
print(results.errors) # failed documents collected, not raised
# Export
results.to_dataframe()
results.to_csv("output.csv")
results.to_parquet("output.parquet")
All strategies are async-first with sync wrappers (Jupyter-safe):
# Async
results = await pipeline.arun(documents=docs)
# Sync (auto-detects running event loop in notebooks)
results = pipeline.run(documents=docs)
pip install catchfly # Core only (~5 MB)
pip install catchfly[openai] # + OpenAI SDK
pip install catchfly[embeddings] # + sentence-transformers (SapBERT, local)
pip install catchfly[clustering] # + scikit-learn, numpy, umap
pip install catchfly[export] # + pandas, pyarrow
pip install catchfly[medical] # + ontology loaders (HPO)
pip install catchfly[all] # Everything
Or with uv:
uv add catchfly[openai,clustering,export]
catchfly
├── discovery/
│ ├── SinglePassDiscovery # 1-shot schema from samples
│ ├── ThreeStageDiscovery # Progressive 3-stage refinement
│ └── SchemaOptimizer # PARSE-style field enrichment
├── extraction/
│ └── LLMDirectExtraction # Tool calling + retry + chunking
├── normalization/
│ ├── CascadeNormalization # Chain strategies, confidence routing, learn()
│ ├── OntologyMapping # Embed → NN → LLM rerank (RAG augmentation)
│ ├── LLMCanonicalization # LLM synonym grouping (map-reduce)
│ ├── LearnedDictionaryCache # Persist mappings for reuse across runs
│ └── EmbeddingClustering # Embed → cluster → canonicalize
├── providers/
│ ├── OpenAICompatibleClient # Any OpenAI-compatible LLM endpoint
│ ├── OpenAIEmbeddingClient # API embeddings with caching
│ └── SentenceTransformerEmbeddingClient # Local embeddings (SapBERT)
├── schema/
│ ├── SchemaRegistry # Version + diff + persist
│ └── converters # JSON Schema ↔ Pydantic roundtrip
└── Pipeline # Orchestrator: quick(), run(), arun()
Apache 2.0 — see LICENSE.