Today's featured cutting-edge AI information, welcome to read 👇
🎙️ Ultravox: An open-source multimodal real-time speech model, supporting multilingual direct understanding of text and speech, with only 150ms response time, based on Llama3.1 8B model.
👗 Comfyui_Object_Migration: A stable ComfyUI clothing migration workflow, enabling virtual try-on and anime-to-realistic style clothing transfer.
📑 MinerU: A powerful PDF document extraction tool, supporting structured extraction of various content, multilingual OCR, cross-platform usage, suitable for document processing scenarios.
Cutting-edge Technology
1. An open-source multimodal real-time speech model: Ultravox.
It can directly understand text and human speech without requiring separate audio speech recognition (ASR), with a response time of about 150 milliseconds, outputting 60 tokens per second using the Llama3.1 8B model.
Detailed introduction: https://www.ultravox.ai/blog/ultravox-an-open-weight-alternative-to-gpt-4o-realtime
Online demo: https://huggingface.co/spaces/freddyaboulton/talk-to-ultravox
Currently, it can accept audio and output text, supporting multiple languages including Chinese, English, German, and more.
Open Source Projects
1. A very stable clothing migration ComfyUI workflow: Comfyui_Object_Migration.
By providing just one clothing photo, it can transfer it onto a model, maintaining clothing consistency with natural, realistic, and high-detail preservation, suitable for virtual try-on.
GitHub: https://github.com/TTPlanetPig/Comfyui_Object_Migration
Additionally, it can perform style transfer, converting anime-style clothing to realistic style clothing, with impressive demonstration results.
2. A powerful open-source PDF document extraction tool: MinerU.
It can extract images, text, tables, footnotes, and other content while preserving the original PDF document structure, automatically recognizing and converting LaTeX format and HTML format in documents.
Main features include:
- Removes headers, footers, footnotes, page numbers, etc., ensuring semantic coherence
- Outputs text in human-readable order, suitable for single-column, multi-column, and complex layouts
- Preserves original document structure, including titles, paragraphs, lists, etc.
- Extracts images, image descriptions, tables, table captions, and footnotes
- Automatically recognizes and converts formulas to LaTeX format
- Automatically recognizes and converts tables to HTML format
- Automatically detects scanned PDFs and corrupted PDFs, enabling OCR functionality
- OCR supports detection and recognition of 84 languages
- Supports multiple output formats, such as Markdown for multimodal and NLP, reading order-sorted JSON, intermediate format with rich information, etc.
- Supports various visualization results, including layout visualization, span visualization, etc., for efficient output verification and quality inspection
- Supports CPU and GPU environments
GitHub: https://github.com/opendatalab/MinerU
Supports cross-platform use on Windows, macOS, and Linux. Those who need it can give it a try.