Kreuzberg v4.0.0-rc.8 is available

Submitted ⁨⁨2⁩ ⁨months⁩ ago⁩ by ⁨bot@lemmit.online [bot]⁩ to ⁨opensource@lemmit.online⁩

https://old.reddit.com/r/opensource/comments/1pn2g9r/kreuzberg_v400rc8_is_available/

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/opensource by /u/Goldziher on 2025-12-15 08:33:17+00:00.

Hi Peeps,

I’m excited to announce that Kreuzberg v4.0.0 is coming very soon. We will release v4.0.0 at the beginning of next year - in just a couple of weeks time. For now, v4.0.0-rc.8 has been released to all channels.

What is Kreuzberg?

Kreuzberg is a document intelligence toolkit for extracting text, metadata, tables, images, and structured data from 56+ file formats. It was originally written in Python (v1-v3), where it demonstrated strong performance characteristics compared to alternatives in the ecosystem.

What’s new in V4?

A Complete Rust Rewrite with Polyglot Bindings

The new version of Kreuzberg represents a massive architectural evolution. Kreuzberg has been completely rewritten in Rust - leveraging Rust’s memory safety, zero-cost abstractions, and native performance. The new architecture consists of a high-performance Rust core with native bindings to multiple languages. That’s right - it’s no longer just a Python library.

Kreuzberg v4 is now available for 7 languages across 8 runtime bindings:

Rust (native library)
Python (PyO3 native bindings)
TypeScript - Node.js (NAPI-RS native bindings) + Deno/Browser/Edge (WASM)
Ruby (Magnus FFI)
Java 25+ (Panama Foreign Function & Memory API)
C# (P/Invoke)
Go (cgo bindings)

Post v4.0.0 roadmap includes:

PHP
Elixir (via Rustler - with Erlang and Gleam interop)

Additionally, it’s available as a CLI (installable via cargo or homebrew), HTTP REST API server, Model Context Protocol (MCP) server for Claude Desktop/Continue.dev, and as public Docker images.

Why the Rust Rewrite? Performance and Architecture

The Rust rewrite wasn’t just about performance - though that’s a major benefit. It was an opportunity to fundamentally rethink the architecture:

Architectural improvements:

Zero-copy operations via Rust’s ownership model
True async concurrency with Tokio runtime (no GIL limitations)
Streaming parsers for constant memory usage on multi-GB files
SIMD-accelerated text processing for token reduction and string operations
Memory-safe FFI boundaries for all language bindings
Plugin system with trait-based extensibility

v3 vs v4: What Changed?

Aspect	v3 (Python)	v4 (Rust Core)
Core Language	Pure Python	Rust 2024 edition
File Formats	30-40+ (via Pandoc)	56+ (native parsers)
Language Support	Python only	7 languages (Rust/Python/TS/Ruby/Java/Go/C#)
Dependencies	Requires Pandoc (system binary)	Zero system dependencies (all native)
Embeddings	Not supported	✓ FastEmbed with ONNX (3 presets + custom)
Semantic Chunking	Via semantic-text-splitter library	✓ Built-in (text + markdown-aware)
Token Reduction	Built-in (TF-IDF based)	✓ Enhanced with 3 modes
Language Detection	Optional (fast-langdetect)	✓ Built-in (68 languages)
Keyword Extraction	Optional (KeyBERT)	✓ Built-in (YAKE + RAKE algorithms)
OCR Backends	Tesseract/EasyOCR/PaddleOCR	Same + better integration
Plugin System	Limited extractor registry	Full trait-based (4 plugin types)
Page Tracking	Character-based indices	Byte-based with O(1) lookup
Servers	REST API (Litestar)	HTTP (Axum) + MCP + MCP-SSE
Installation Size	~100MB base	16-31 MB complete
Memory Model	Python heap management	RAII with streaming
Concurrency	asyncio (GIL-limited)	Tokio work-stealing

Replacement of Pandoc - Native Performance

Kreuzberg v3 relied on Pandoc - an amazing tool, but one that had to be invoked via subprocess because of its GPL license. This had significant impacts:

v3 Pandoc limitations:

System dependency (installation required)
Subprocess overhead on every document
No streaming support
Limited metadata extraction
~500MB+ installation footprint

v4 native parsers:

Zero external dependencies - everything is native Rust
Direct parsing with full control over extraction
Substantially more metadata extracted (e.g., DOCX document properties, section structure, style information)
Streaming support for massive files (tested on multi-GB XML documents with stable memory)
Example: PPTX extractor is now a fully streaming parser capable of handling gigabyte-scale presentations with constant memory usage and high throughput

New File Format Support

v4 expanded format support from ~20 to 56+ file formats, including:

Added legacy format support:

.doc (Word 97-2003)
.ppt (PowerPoint 97-2003)
.xls (Excel 97-2003)
.eml (Email messages)
.msg (Outlook messages)

Added academic/technical formats:

LaTeX (.tex)
BibTeX (.bib)
Typst (.typ)
JATS XML (scientific articles)
DocBook XML
FictionBook (.fb2)
OPML (.opml)

Better Office support:

XLSB, XLSM (Excel binary/macro formats)
Better structured metadata extraction from DOCX/PPTX/XLSX
Full table extraction from presentations
Image extraction with deduplication

New Features: Full Document Intelligence Solution

The v4 rewrite was also an opportunity to close gaps with commercial alternatives and add features specifically designed for RAG applications and LLM workflows:

1. Embeddings (NEW)

FastEmbed integration with full ONNX Runtime acceleration
Three presets: “fast” (384d), “balanced” (512d), “quality” (768d/1024d)
Custom model support (bring your own ONNX model)
Local generation (no API calls, no rate limits)
Automatic model downloading and caching
Per-chunk embedding generation

from kreuzberg import ExtractionConfig, EmbeddingConfig, EmbeddingModelType

config = ExtractionConfig(
 embeddings=EmbeddingConfig(
 model=EmbeddingModelType.preset("balanced"),
 normalize=True
 )
)
result = kreuzberg.extract\_bytes(pdf\_bytes, config=config)

result.embeddings contains vectors for each chunk
=================================================

2. Semantic Text Chunking (NOW BUILT-IN)

Now integrated directly into the core (v3 used external semantic-text-splitter library):

Structure-aware chunking that respects document semantics
Two strategies:
Generic text chunker (whitespace/punctuation-aware)
Markdown chunker (preserves headings, lists, code blocks, tables)
Configurable chunk size and overlap
Unicode-safe (handles CJK, emojis correctly)
Automatic chunk-to-page mapping
Per-chunk metadata with byte offsets

3. Byte-Accurate Page Tracking (BREAKING CHANGE)

This is a critical improvement for LLM applications:

v3: Character-based indices (char_start/char_end) - incorrect for UTF-8 multi-byte characters
v4: Byte-based indices (byte_start/byte_end) - correct for all string operations

Additional page features:

O(1) lookup: “which page is byte offset X on?” → instant answer
Per-page content extraction
Page markers in combined text (e.g., — Page 5 —)
Automatic chunk-to-page mapping for citations

4. Enhanced Token Reduction for LLM Context

Enhanced from v3 with three configurable modes to save on LLM costs:

Light mode: ~15% reduction (preserve most detail)
Moderate mode: ~30% reduction (balanced)
Aggressive mode: ~50% reduction (key information only)

Uses TF-IDF sentence scoring with position-aware weighting and language-specific stopword filtering. SIMD-accelerated for improved performance over v3.

5. Language Detection (NOW BUILT-IN)

68 language support with confidence scoring
Multi-language detection (documents with mixed languages)
ISO 639-1 and ISO 639-3 code support
Configurable confidence thresholds

6. Keyword Extraction (NOW BUILT-IN)

Now built into core (previously optional KeyBERT in v3):

YAKE (Yet Another Keyword Extractor): Unsupervised, language-independent
RAKE (Rapid Automatic Keyword Extraction): Fast statistical method
Configurable n-grams (1-3 word phrases)
Relevance scoring with language-specific stopwords

7. Plugin System (NEW)

Four extensible plugin types for customization:

DocumentExtractor - Custom file format handlers
OcrBackend - Custom OCR engines (integrate your own Python models)
PostProcessor - Data transformation and enrichment
Validator - Pre-extraction validation

Plugins defined in Rust work across all language bindings. Python/TypeScript can define custom plugins with thread-safe callbacks into the Rust core.

8. Production-Ready Servers (NEW)

HTTP REST API: Production-grade Axum server with OpenAPI docs
MCP Server: Direct integration with Claude Desktop, Continue.dev, and other MCP clients
MCP-SSE Transport (RC.8): Server-Sent Events for cloud deployments without WebSocket support
All three modes support the same feature set: extraction, batch processing, caching

Performance: Benchmarked Against the Competition

We maintain continuous benchmarks comparing Kreuzberg against the leading OSS alternatives:

Benchmark Setup

Platform: Ubuntu 22.04 (GitHub Actions)
Test Suite: 30+ documents covering all formats
Metrics: Latency (p50, p95), throughput …

Content cut off. Read original on old.reddit.com/…/kreuzberg_v400rc8_is_available/

source

Comments

Sort:hotnew top