November 17

Today's featured cutting-edge AI information, welcome to read 👇

🎙️ Ultravox: An open-source multimodal real-time speech model, supporting multilingual direct understanding of text and speech, with only 150ms response time, based on Llama3.1 8B model.

👗 Comfyui_Object_Migration: A stable ComfyUI clothing migration workflow, enabling virtual try-on and anime-to-realistic style clothing transfer.

📑 MinerU: A powerful PDF document extraction tool, supporting structured extraction of various content, multilingual OCR, cross-platform usage, suitable for document processing scenarios.

Cutting-edge Technology

1. An open-source multimodal real-time speech model: Ultravox.

It can directly understand text and human speech without requiring separate audio speech recognition (ASR), with a response time of about 150 milliseconds, outputting 60 tokens per second using the Llama3.1 8B model.

Detailed introduction: https://www.ultravox.ai/blog/ultravox-an-open-weight-alternative-to-gpt-4o-realtime

Online demo: https://huggingface.co/spaces/freddyaboulton/talk-to-ultravox

Currently, it can accept audio and output text, supporting multiple languages including Chinese, English, German, and more.

Open Source Projects

1. A very stable clothing migration ComfyUI workflow: Comfyui_Object_Migration.

By providing just one clothing photo, it can transfer it onto a model, maintaining clothing consistency with natural, realistic, and high-detail preservation, suitable for virtual try-on.

GitHub: https://github.com/TTPlanetPig/Comfyui_Object_Migration

Additionally, it can perform style transfer, converting anime-style clothing to realistic style clothing, with impressive demonstration results.

2. A powerful open-source PDF document extraction tool: MinerU.

It can extract images, text, tables, footnotes, and other content while preserving the original PDF document structure, automatically recognizing and converting LaTeX format and HTML format in documents.

Main features include:

Removes headers, footers, footnotes, page numbers, etc., ensuring semantic coherence
Outputs text in human-readable order, suitable for single-column, multi-column, and complex layouts
Preserves original document structure, including titles, paragraphs, lists, etc.
Extracts images, image descriptions, tables, table captions, and footnotes
Automatically recognizes and converts formulas to LaTeX format
Automatically recognizes and converts tables to HTML format
Automatically detects scanned PDFs and corrupted PDFs, enabling OCR functionality
OCR supports detection and recognition of 84 languages
Supports multiple output formats, such as Markdown for multimodal and NLP, reading order-sorted JSON, intermediate format with rich information, etc.
Supports various visualization results, including layout visualization, span visualization, etc., for efficient output verification and quality inspection
Supports CPU and GPU environments

GitHub: https://github.com/opendatalab/MinerU

Supports cross-platform use on Windows, macOS, and Linux. Those who need it can give it a try.

Cutting-edge Technology ​

Open Source Projects ​

Cutting-edge Technology

Open Source Projects