This is an automated archive made by the Lemmit Bot.

The original was posted on /r/opensource by /u/Goldziher on 2025-12-15 08:33:17+00:00.


Hi Peeps,

I’m excited to announce that Kreuzberg v4.0.0 is coming very soon. We will release v4.0.0 at the beginning of next year - in just a couple of weeks time. For now, v4.0.0-rc.8 has been released to all channels.

What is Kreuzberg?

Kreuzberg is a document intelligence toolkit for extracting text, metadata, tables, images, and structured data from 56+ file formats. It was originally written in Python (v1-v3), where it demonstrated strong performance characteristics compared to alternatives in the ecosystem.

What’s new in V4?

A Complete Rust Rewrite with Polyglot Bindings

The new version of Kreuzberg represents a massive architectural evolution. Kreuzberg has been completely rewritten in Rust - leveraging Rust’s memory safety, zero-cost abstractions, and native performance. The new architecture consists of a high-performance Rust core with native bindings to multiple languages. That’s right - it’s no longer just a Python library.

Kreuzberg v4 is now available for 7 languages across 8 runtime bindings:

  • Rust (native library)
  • Python (PyO3 native bindings)
  • TypeScript - Node.js (NAPI-RS native bindings) + Deno/Browser/Edge (WASM)
  • Ruby (Magnus FFI)
  • Java 25+ (Panama Foreign Function & Memory API)
  • C# (P/Invoke)
  • Go (cgo bindings)

Post v4.0.0 roadmap includes:

  • PHP
  • Elixir (via Rustler - with Erlang and Gleam interop)

Additionally, it’s available as a CLI (installable via cargo or homebrew), HTTP REST API server, Model Context Protocol (MCP) server for Claude Desktop/Continue.dev, and as public Docker images.

Why the Rust Rewrite? Performance and Architecture

The Rust rewrite wasn’t just about performance - though that’s a major benefit. It was an opportunity to fundamentally rethink the architecture:

Architectural improvements:

  • Zero-copy operations via Rust’s ownership model
  • True async concurrency with Tokio runtime (no GIL limitations)
  • Streaming parsers for constant memory usage on multi-GB files
  • SIMD-accelerated text processing for token reduction and string operations
  • Memory-safe FFI boundaries for all language bindings
  • Plugin system with trait-based extensibility

v3 vs v4: What Changed?

Aspect v3 (Python) v4 (Rust Core)
Core Language Pure Python Rust 2024 edition
File Formats 30-40+ (via Pandoc) 56+ (native parsers)
Language Support Python only 7 languages (Rust/Python/TS/Ruby/Java/Go/C#)
Dependencies Requires Pandoc (system binary) Zero system dependencies (all native)
Embeddings Not supported ✓ FastEmbed with ONNX (3 presets + custom)
Semantic Chunking Via semantic-text-splitter library ✓ Built-in (text + markdown-aware)
Token Reduction Built-in (TF-IDF based) ✓ Enhanced with 3 modes
Language Detection Optional (fast-langdetect) ✓ Built-in (68 languages)
Keyword Extraction Optional (KeyBERT) ✓ Built-in (YAKE + RAKE algorithms)
OCR Backends Tesseract/EasyOCR/PaddleOCR Same + better integration
Plugin System Limited extractor registry Full trait-based (4 plugin types)
Page Tracking Character-based indices Byte-based with O(1) lookup
Servers REST API (Litestar) HTTP (Axum) + MCP + MCP-SSE
Installation Size ~100MB base 16-31 MB complete
Memory Model Python heap management RAII with streaming
Concurrency asyncio (GIL-limited) Tokio work-stealing

Replacement of Pandoc - Native Performance

Kreuzberg v3 relied on Pandoc - an amazing tool, but one that had to be invoked via subprocess because of its GPL license. This had significant impacts:

v3 Pandoc limitations:

  • System dependency (installation required)
  • Subprocess overhead on every document
  • No streaming support
  • Limited metadata extraction
  • ~500MB+ installation footprint

v4 native parsers:

  • Zero external dependencies - everything is native Rust
  • Direct parsing with full control over extraction
  • Substantially more metadata extracted (e.g., DOCX document properties, section structure, style information)
  • Streaming support for massive files (tested on multi-GB XML documents with stable memory)
  • Example: PPTX extractor is now a fully streaming parser capable of handling gigabyte-scale presentations with constant memory usage and high throughput

New File Format Support

v4 expanded format support from ~20 to 56+ file formats, including:

Added legacy format support:

  • .doc (Word 97-2003)
  • .ppt (PowerPoint 97-2003)
  • .xls (Excel 97-2003)
  • .eml (Email messages)
  • .msg (Outlook messages)

Added academic/technical formats:

  • LaTeX (.tex)
  • BibTeX (.bib)
  • Typst (.typ)
  • JATS XML (scientific articles)
  • DocBook XML
  • FictionBook (.fb2)
  • OPML (.opml)

Better Office support:

  • XLSB, XLSM (Excel binary/macro formats)
  • Better structured metadata extraction from DOCX/PPTX/XLSX
  • Full table extraction from presentations
  • Image extraction with deduplication

New Features: Full Document Intelligence Solution

The v4 rewrite was also an opportunity to close gaps with commercial alternatives and add features specifically designed for RAG applications and LLM workflows:

1. Embeddings (NEW)

  • FastEmbed integration with full ONNX Runtime acceleration
  • Three presets: “fast” (384d), “balanced” (512d), “quality” (768d/1024d)
  • Custom model support (bring your own ONNX model)
  • Local generation (no API calls, no rate limits)
  • Automatic model downloading and caching
  • Per-chunk embedding generation
from kreuzberg import ExtractionConfig, EmbeddingConfig, EmbeddingModelType

config = ExtractionConfig(
 embeddings=EmbeddingConfig(
 model=EmbeddingModelType.preset("balanced"),
 normalize=True
 )
)
result = kreuzberg.extract\_bytes(pdf\_bytes, config=config)

result.embeddings contains vectors for each chunk
=================================================

2. Semantic Text Chunking (NOW BUILT-IN)

Now integrated directly into the core (v3 used external semantic-text-splitter library):

  • Structure-aware chunking that respects document semantics
  • Two strategies:
  • Generic text chunker (whitespace/punctuation-aware)
  • Markdown chunker (preserves headings, lists, code blocks, tables)
  • Configurable chunk size and overlap
  • Unicode-safe (handles CJK, emojis correctly)
  • Automatic chunk-to-page mapping
  • Per-chunk metadata with byte offsets

3. Byte-Accurate Page Tracking (BREAKING CHANGE)

This is a critical improvement for LLM applications:

  • v3: Character-based indices (char_start/char_end) - incorrect for UTF-8 multi-byte characters
  • v4: Byte-based indices (byte_start/byte_end) - correct for all string operations

Additional page features:

  • O(1) lookup: “which page is byte offset X on?” → instant answer
  • Per-page content extraction
  • Page markers in combined text (e.g., — Page 5 —)
  • Automatic chunk-to-page mapping for citations

4. Enhanced Token Reduction for LLM Context

Enhanced from v3 with three configurable modes to save on LLM costs:

  • Light mode: ~15% reduction (preserve most detail)
  • Moderate mode: ~30% reduction (balanced)
  • Aggressive mode: ~50% reduction (key information only)

Uses TF-IDF sentence scoring with position-aware weighting and language-specific stopword filtering. SIMD-accelerated for improved performance over v3.

5. Language Detection (NOW BUILT-IN)

  • 68 language support with confidence scoring
  • Multi-language detection (documents with mixed languages)
  • ISO 639-1 and ISO 639-3 code support
  • Configurable confidence thresholds

6. Keyword Extraction (NOW BUILT-IN)

Now built into core (previously optional KeyBERT in v3):

  • YAKE (Yet Another Keyword Extractor): Unsupervised, language-independent
  • RAKE (Rapid Automatic Keyword Extraction): Fast statistical method
  • Configurable n-grams (1-3 word phrases)
  • Relevance scoring with language-specific stopwords

7. Plugin System (NEW)

Four extensible plugin types for customization:

  • DocumentExtractor - Custom file format handlers
  • OcrBackend - Custom OCR engines (integrate your own Python models)
  • PostProcessor - Data transformation and enrichment
  • Validator - Pre-extraction validation

Plugins defined in Rust work across all language bindings. Python/TypeScript can define custom plugins with thread-safe callbacks into the Rust core.

8. Production-Ready Servers (NEW)

  • HTTP REST API: Production-grade Axum server with OpenAPI docs
  • MCP Server: Direct integration with Claude Desktop, Continue.dev, and other MCP clients
  • MCP-SSE Transport (RC.8): Server-Sent Events for cloud deployments without WebSocket support
  • All three modes support the same feature set: extraction, batch processing, caching

Performance: Benchmarked Against the Competition

We maintain continuous benchmarks comparing Kreuzberg against the leading OSS alternatives:

Benchmark Setup

  • Platform: Ubuntu 22.04 (GitHub Actions)
  • Test Suite: 30+ documents covering all formats
  • Metrics: Latency (p50, p95), throughput …

Content cut off. Read original on old.reddit.com/…/kreuzberg_v400rc8_is_available/