This is an automated archive made by the Lemmit Bot.
The original was posted on /r/opensource by /u/Goldziher on 2025-12-15 08:33:17+00:00.
Hi Peeps,
I’m excited to announce that Kreuzberg v4.0.0 is coming very soon. We will release v4.0.0 at the beginning of next year - in just a couple of weeks time. For now, v4.0.0-rc.8 has been released to all channels.
What is Kreuzberg?
Kreuzberg is a document intelligence toolkit for extracting text, metadata, tables, images, and structured data from 56+ file formats. It was originally written in Python (v1-v3), where it demonstrated strong performance characteristics compared to alternatives in the ecosystem.
What’s new in V4?
A Complete Rust Rewrite with Polyglot Bindings
The new version of Kreuzberg represents a massive architectural evolution. Kreuzberg has been completely rewritten in Rust - leveraging Rust’s memory safety, zero-cost abstractions, and native performance. The new architecture consists of a high-performance Rust core with native bindings to multiple languages. That’s right - it’s no longer just a Python library.
Kreuzberg v4 is now available for 7 languages across 8 runtime bindings:
- Rust (native library)
- Python (PyO3 native bindings)
- TypeScript - Node.js (NAPI-RS native bindings) + Deno/Browser/Edge (WASM)
- Ruby (Magnus FFI)
- Java 25+ (Panama Foreign Function & Memory API)
- C# (P/Invoke)
- Go (cgo bindings)
Post v4.0.0 roadmap includes:
- PHP
- Elixir (via Rustler - with Erlang and Gleam interop)
Additionally, it’s available as a CLI (installable via cargo or homebrew), HTTP REST API server, Model Context Protocol (MCP) server for Claude Desktop/Continue.dev, and as public Docker images.
Why the Rust Rewrite? Performance and Architecture
The Rust rewrite wasn’t just about performance - though that’s a major benefit. It was an opportunity to fundamentally rethink the architecture:
Architectural improvements:
- Zero-copy operations via Rust’s ownership model
- True async concurrency with Tokio runtime (no GIL limitations)
- Streaming parsers for constant memory usage on multi-GB files
- SIMD-accelerated text processing for token reduction and string operations
- Memory-safe FFI boundaries for all language bindings
- Plugin system with trait-based extensibility
v3 vs v4: What Changed?
| Aspect | v3 (Python) | v4 (Rust Core) |
|---|---|---|
| Core Language | Pure Python | Rust 2024 edition |
| File Formats | 30-40+ (via Pandoc) | 56+ (native parsers) |
| Language Support | Python only | 7 languages (Rust/Python/TS/Ruby/Java/Go/C#) |
| Dependencies | Requires Pandoc (system binary) | Zero system dependencies (all native) |
| Embeddings | Not supported | ✓ FastEmbed with ONNX (3 presets + custom) |
| Semantic Chunking | Via semantic-text-splitter library | ✓ Built-in (text + markdown-aware) |
| Token Reduction | Built-in (TF-IDF based) | ✓ Enhanced with 3 modes |
| Language Detection | Optional (fast-langdetect) | ✓ Built-in (68 languages) |
| Keyword Extraction | Optional (KeyBERT) | ✓ Built-in (YAKE + RAKE algorithms) |
| OCR Backends | Tesseract/EasyOCR/PaddleOCR | Same + better integration |
| Plugin System | Limited extractor registry | Full trait-based (4 plugin types) |
| Page Tracking | Character-based indices | Byte-based with O(1) lookup |
| Servers | REST API (Litestar) | HTTP (Axum) + MCP + MCP-SSE |
| Installation Size | ~100MB base | 16-31 MB complete |
| Memory Model | Python heap management | RAII with streaming |
| Concurrency | asyncio (GIL-limited) | Tokio work-stealing |
Replacement of Pandoc - Native Performance
Kreuzberg v3 relied on Pandoc - an amazing tool, but one that had to be invoked via subprocess because of its GPL license. This had significant impacts:
v3 Pandoc limitations:
- System dependency (installation required)
- Subprocess overhead on every document
- No streaming support
- Limited metadata extraction
- ~500MB+ installation footprint
v4 native parsers:
- Zero external dependencies - everything is native Rust
- Direct parsing with full control over extraction
- Substantially more metadata extracted (e.g., DOCX document properties, section structure, style information)
- Streaming support for massive files (tested on multi-GB XML documents with stable memory)
- Example: PPTX extractor is now a fully streaming parser capable of handling gigabyte-scale presentations with constant memory usage and high throughput
New File Format Support
v4 expanded format support from ~20 to 56+ file formats, including:
Added legacy format support:
.doc(Word 97-2003).ppt(PowerPoint 97-2003).xls(Excel 97-2003).eml(Email messages).msg(Outlook messages)
Added academic/technical formats:
- LaTeX (
.tex) - BibTeX (
.bib) - Typst (
.typ) - JATS XML (scientific articles)
- DocBook XML
- FictionBook (
.fb2) - OPML (
.opml)
Better Office support:
- XLSB, XLSM (Excel binary/macro formats)
- Better structured metadata extraction from DOCX/PPTX/XLSX
- Full table extraction from presentations
- Image extraction with deduplication
New Features: Full Document Intelligence Solution
The v4 rewrite was also an opportunity to close gaps with commercial alternatives and add features specifically designed for RAG applications and LLM workflows:
1. Embeddings (NEW)
- FastEmbed integration with full ONNX Runtime acceleration
- Three presets:
“fast”(384d),“balanced”(512d),“quality”(768d/1024d) - Custom model support (bring your own ONNX model)
- Local generation (no API calls, no rate limits)
- Automatic model downloading and caching
- Per-chunk embedding generation
from kreuzberg import ExtractionConfig, EmbeddingConfig, EmbeddingModelType config = ExtractionConfig( embeddings=EmbeddingConfig( model=EmbeddingModelType.preset("balanced"), normalize=True ) ) result = kreuzberg.extract\_bytes(pdf\_bytes, config=config) result.embeddings contains vectors for each chunk =================================================
2. Semantic Text Chunking (NOW BUILT-IN)
Now integrated directly into the core (v3 used external semantic-text-splitter library):
- Structure-aware chunking that respects document semantics
- Two strategies:
- Generic text chunker (whitespace/punctuation-aware)
- Markdown chunker (preserves headings, lists, code blocks, tables)
- Configurable chunk size and overlap
- Unicode-safe (handles CJK, emojis correctly)
- Automatic chunk-to-page mapping
- Per-chunk metadata with byte offsets
3. Byte-Accurate Page Tracking (BREAKING CHANGE)
This is a critical improvement for LLM applications:
- v3: Character-based indices (
char_start/char_end) - incorrect for UTF-8 multi-byte characters - v4: Byte-based indices (
byte_start/byte_end) - correct for all string operations
Additional page features:
- O(1) lookup: “which page is byte offset X on?” → instant answer
- Per-page content extraction
- Page markers in combined text (e.g.,
— Page 5 —) - Automatic chunk-to-page mapping for citations
4. Enhanced Token Reduction for LLM Context
Enhanced from v3 with three configurable modes to save on LLM costs:
- Light mode: ~15% reduction (preserve most detail)
- Moderate mode: ~30% reduction (balanced)
- Aggressive mode: ~50% reduction (key information only)
Uses TF-IDF sentence scoring with position-aware weighting and language-specific stopword filtering. SIMD-accelerated for improved performance over v3.
5. Language Detection (NOW BUILT-IN)
- 68 language support with confidence scoring
- Multi-language detection (documents with mixed languages)
- ISO 639-1 and ISO 639-3 code support
- Configurable confidence thresholds
6. Keyword Extraction (NOW BUILT-IN)
Now built into core (previously optional KeyBERT in v3):
- YAKE (Yet Another Keyword Extractor): Unsupervised, language-independent
- RAKE (Rapid Automatic Keyword Extraction): Fast statistical method
- Configurable n-grams (1-3 word phrases)
- Relevance scoring with language-specific stopwords
7. Plugin System (NEW)
Four extensible plugin types for customization:
- DocumentExtractor - Custom file format handlers
- OcrBackend - Custom OCR engines (integrate your own Python models)
- PostProcessor - Data transformation and enrichment
- Validator - Pre-extraction validation
Plugins defined in Rust work across all language bindings. Python/TypeScript can define custom plugins with thread-safe callbacks into the Rust core.
8. Production-Ready Servers (NEW)
- HTTP REST API: Production-grade Axum server with OpenAPI docs
- MCP Server: Direct integration with Claude Desktop, Continue.dev, and other MCP clients
- MCP-SSE Transport (RC.8): Server-Sent Events for cloud deployments without WebSocket support
- All three modes support the same feature set: extraction, batch processing, caching
Performance: Benchmarked Against the Competition
We maintain continuous benchmarks comparing Kreuzberg against the leading OSS alternatives:
Benchmark Setup
- Platform: Ubuntu 22.04 (GitHub Actions)
- Test Suite: 30+ documents covering all formats
- Metrics: Latency (p50, p95), throughput …
Content cut off. Read original on old.reddit.com/…/kreuzberg_v400rc8_is_available/