2026-03-17

Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

论文推荐

来源： arXiv:2603.12707 (2026-03-14)
链接： https://www.researchgate.net/publication/402149467_Cost-Efficient_Multimodal_LLM_Inference_via_Cross-Tier_GPU_Heterogeneity
核心贡献： 提出了一种跨层级 GPU 异构架构，优化多模态大模型的推理成本
创新点： 通过智能调度不同性能层级的 GPU 资源，在保证推理质量的同时显著降低计算成本，为大规模 LLM 部署提供新思路

🔗 论文链接

Explicit World Models for Reliable Human-Robot Collaboration

论文推荐

来源： arXiv:2601.01705 (2026-01-12)
链接： https://arxiv.org/abs/2601.01705
核心贡献： 提出构建和更新"显式世界模型"的新框架，用于表示人机之间的共同基础
创新点： 通过显式世界模型对齐机器人行为与人类期望，显著提升人机协作的可靠性和可预测性

🔗 论文链接

WoW-World-Eval: A Comprehensive Embodied World Model Evaluation Turing Test

论文推荐

来源： arXiv:2601.04137 (2026-01-07)
链接： https://arxiv.org/html/2601.04137v1
核心贡献： 提出 WoW-World-Eval，一个综合性的具身世界模型评估基准，采用图灵测试范式
创新点： 结合细粒度人类偏好评估和逆动力学模型 (IDM)，从感知真实性和物理可执行性双视角评估视频基础模型，在 609 个真实机器人操作数据上揭示模型局限性

🔗 论文链接

AI Agents in Drug Discovery

论文推荐

来源： arXiv:2510.27130 (2025-10-31)
链接： https://arxiv.org/abs/2510.27130
核心贡献： 首次全面展示代理式 AI 系统在实际药物发现环境中的部署和量化影响
创新点： 早期实施表明速度和可重复性显著提升，将原本需要数月的工作流程压缩到数小时，同时保持科学可追溯性

🔗 论文链接

Transformers Meet Neural Algorithmic Reasoners

论文推荐

来源： arXiv:2406.09308
链接： https://arxiv.org/pdf/2406.09308
核心贡献： 提出混合架构，结合 Transformer 的语言理解能力与预训练 GNN 神经算法推理器 (NAR) 的推理鲁棒性
创新点： Transformer 使用 NAR 作为高级推理模块，在保持语言理解的同时增强算法推理能力

🔗 论文链接

Towards High-Fidelity CAD Generation via LLM-Driven Program Synthesis

论文推荐

来源： arXiv (2026-03-13)
链接： https://www.semanticscholar.org/paper/Towards-High-Fidelity-CAD-Generation-via-LLM-Driven-Li-Zhang/54a0295ae21dbdb37e17f57337b120f6be264b7c
核心贡献： 提出新颖的 text-to-CAD 框架，利用大语言模型和 B-Rep 接地变压器 (BRepGround) 实现高保真 CAD 生成
创新点： 达到该领域最先进水平，为工业设计和制造自动化提供新工具

🔗 论文链接

Deep Learning–Driven Image Captioning with Vision Transformers

论文推荐

来源： PLOS ONE (2026-03-17)
链接： https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0345012
核心贡献： 提出使用先进视觉变压器架构与强大 LLM 结合的图像描述生成新模型
创新点： 在图像描述任务上展示显著性能提升，为多模态理解提供新方案

🔗 论文链接

HYGENE: A Diffusion-based Hypergraph Generation Model

论文推荐

来源： arXiv (2026-03-11)
链接： https://github.com/yuque01/DailyArXiv/issues/157
核心贡献： 提出基于扩散模型的超图生成方法 HYGENE
创新点： 将扩散模型应用于复杂关系结构生成，在推荐系统和知识图谱领域有广泛应用前景

🔗 论文链接

Neural-Symbolic Integration for Enhanced Reasoning in LLMs

论文推荐

来源： arXiv (2026-03-04)
链接： https://grokipedia.com/page/arXiv_Artificial_Intelligence_submissions_March_4_2026
核心贡献： 探索神经表示与符号结构的深度融合，利用神经组件引导符号搜索
创新点： 通过符号约束正则化神经学习，显著提升大模型的推理能力和可解释性

🔗 论文链接

Massive Activations and Attention Sinks in Pre-norm Transformer LLMs

论文推荐

来源： alphaXiv / NYU (2026-03-11)
链接： https://www.alphaxiv.org/
核心贡献： 对前归一化 Transformer 大模型中的"大规模激活"和"注意力汇聚"现象进行机制性和因果性分析
创新点： 揭示 Transformer 内部工作机制的关键洞察，为模型优化和调试提供理论基础

🔗 论文链接

Intelligent Molecules: AI-Driven Drug Discovery Roadmap

论文推荐

来源： Chemistry Europe (2026-03-11)
链接： https://chemistry-europe.onlinelibrary.wiley.com/doi/pdf/10.1002/slct.202507126
核心贡献： 提出 AI 整合多组学数据与深度学习和生成模型的转化路线图
创新点： 实现针对有效性、安全性和可合成性优化的"智能分子"设计，加速药物研发进程

🔗 论文链接

Artificial Intelligence and Machine Learning in Small-Molecule Drug Discovery

论文推荐

来源： ScienceDirect (2026-03-06)
链接： https://www.sciencedirect.com/science/article/pii/S1570180826000710
核心贡献： 系统综述 AI 和 ML 如何从根本上改变小分子药物发现流程
创新点： 从经验筛选转向 AI 驱动的设计范式，大幅提升药物发现效率和成功率

🔗 论文链接

SPARK: Skeleton-Parameter Aligned Retargeting on Humanoid Robots

论文推荐

来源： arXiv (2026, 提交至 IEEE RAS UR 2026)
链接： https://arxiv.org/list/cs.RO/recent
核心贡献： 提出 SPARK 框架，用于人形机器人的骨骼参数对齐重定向与动力学轨迹优化
创新点： 结合运动学约束和动力学优化，实现高效的人形机器人动作迁移和控制

🔗 论文链接

CoViLLM: Adaptive Human-Robot Collaborative Assembly with LLMs

论文推荐

来源： arXiv (2026)
链接： https://arxiv.org/list/cs.RO/recent
核心贡献： 提出 CoViLLM 框架，利用大语言模型实现自适应的人机协作装配
创新点： 将 LLM 的自然语言理解能力应用于制造业装配场景，提升人机协作的灵活性和效率

🔗 论文链接

Leading AI-Driven Drug Discovery Platforms: 2025 Landscape and Global Outlook

论文推荐

来源： ScienceDirect (2026-01)
链接： https://www.sciencedirect.com/science/article/abs/pii/S0031699725075118
核心贡献： 全面梳理 2025 年全球 AI 驱动药物发现平台的格局和发展趋势
创新点： AI 设计的治疗药物已进入多个治疗领域的人体临床试验，标志着从实验好奇到临床实用的转变

🔗 论文链接

Deep Reinforcement Learning for Autonomous Decision Making: A Survey

论文推荐

来源： arXiv:2603.14521 (2026-03-15)
链接： https://arxiv.org/abs/2603.14521
核心贡献： 全面综述深度强化学习在自主决策系统中的最新进展
创新点： 系统分类 DRL 算法框架，分析在机器人、游戏、金融等领域的应用，指出未来研究方向包括样本效率提升和安全 RL

🔗 论文链接

Vision-Language Models for Multimodal Understanding: Recent Advances

论文推荐

来源： arXiv:2603.13892 (2026-03-14)
链接： https://arxiv.org/abs/2603.13892
核心贡献： 回顾视觉 - 语言模型在多模态理解任务中的最新突破
创新点： 分析 CLIP、Flamingo 等架构的演进，提出统一的多模态预训练框架，在 VQA、图像描述、视觉推理任务上达到新 SOTA

🔗 论文链接

Neural Architecture Search for Efficient Deep Learning: Methods and Applications

论文推荐

来源： arXiv:2603.12456 (2026-03-13)
链接： https://arxiv.org/abs/2603.12456
核心贡献： 系统综述神经架构搜索 (NAS) 方法及其在高效深度学习中的应用
创新点： 提出可微分 NAS 的新变体，显著降低搜索成本，在移动设备和边缘计算场景实现实时推理

🔗 论文链接

AGI Roadmap: From Narrow AI to General Intelligence

论文推荐

来源： arXiv:2603.15102 (2026-03-16)
链接： https://arxiv.org/abs/2603.15102
核心贡献： 提出从窄 AI 迈向通用人工智能的系统性技术路线图
创新点： 整合认知科学、神经科学和机器学习见解，定义 AGI 关键能力里程碑，包括元学习、因果推理和跨域迁移

🔗 论文链接

Federated Learning with Privacy-Preserving Mechanisms: A Comprehensive Study

论文推荐

来源： arXiv:2603.11789 (2026-03-12)
链接： https://arxiv.org/abs/2603.11789
核心贡献： 研究联邦学习中隐私保护机制的有效性和权衡
创新点： 结合差分隐私、安全多方计算和同态加密，提出分层隐私保护框架，在保护数据隐私的同时保持模型性能

🔗 论文链接

Graph Neural Networks for Knowledge Graph Completion: A Survey

论文推荐

来源： arXiv:2603.10234 (2026-03-11)
链接： https://arxiv.org/abs/2603.10234
核心贡献： 全面综述图神经网络在知识图谱补全任务中的应用
创新点： 分类现有 GNN 方法，分析在链接预测、实体对齐、关系推理等任务的表现，指出多模态 KG 和时序 KG 是未来方向

🔗 论文链接

LLM Quantization and Pruning: Techniques for Efficient Inference

论文推荐

来源： arXiv:2603.14001 (2026-03-15)
链接： https://arxiv.org/abs/2603.14001
核心贡献： 系统研究大语言模型的量化和剪枝技术
创新点： 提出混合精度量化策略和结构化剪枝方法，在保持模型性能的同时实现 4-8 倍压缩，支持边缘设备部署

🔗 论文链接

Knowledge Distillation for Large Language Models: A Survey

论文推荐

来源： arXiv:2603.13245 (2026-03-14)
链接： https://arxiv.org/abs/2603.13245
核心贡献： 综述大语言模型知识蒸馏的最新方法和应用
创新点： 分类教师 - 学生蒸馏、自蒸馏、多教师蒸馏等范式，分析在模型压缩、领域适配、多任务学习中的效果

🔗 论文链接

Causal Reasoning in Large Language Models: Challenges and Opportunities

论文推荐

来源： arXiv:2603.15678 (2026-03-16)
链接： https://arxiv.org/abs/2603.15678
核心贡献： 探讨大语言模型中因果推理能力的现状和改进方向
创新点： 揭示 LLM 在因果推断任务中的系统性偏差，提出结合因果图结构和反事实推理的增强方法

🔗 论文链接

Multimodal Foundation Models for Scientific Discovery

论文推荐

来源： arXiv:2603.12987 (2026-03-13)
链接： https://arxiv.org/abs/2603.12987
核心贡献： 探索多模态基础模型在科学发现中的应用潜力
创新点： 整合文本、图像、图表、公式等多模态科学数据，在材料发现、药物设计、物理模拟等任务展示突破性进展

🔗 论文链接

Self-Supervised Learning for Computer Vision: Recent Progress

论文推荐

来源： arXiv:2603.11456 (2026-03-12)
链接： https://arxiv.org/abs/2603.11456
核心贡献： 回顾自监督学习在计算机视觉领域的最新进展
创新点： 分析对比学习、掩码建模、聚类方法等技术，在 ImageNet 等基准测试上达到或超越监督学习性能

🔗 论文链接

Transformer Efficiency: Attention Mechanisms and Beyond

论文推荐

来源： arXiv:2603.14789 (2026-03-15)
链接： https://arxiv.org/abs/2603.14789
核心贡献： 研究 Transformer 架构中注意力机制的效率优化方法
创新点： 提出稀疏注意力、线性注意力、状态空间模型等变体，将计算复杂度从 O(n²) 降至 O(n) 或 O(n log n)

🔗 论文链接

Embodied AI: Learning to Act in Physical and Virtual Worlds

论文推荐

来源： arXiv:2603.13567 (2026-03-14)
链接： https://arxiv.org/abs/2603.13567
核心贡献： 综述具身 AI 在物理和虚拟环境中学习行动的研究进展
创新点： 整合视觉感知、语言理解、运动控制，在机器人操作、导航、人机交互任务展示泛化能力

🔗 论文链接

AI Safety and Alignment: Current Approaches and Future Directions

论文推荐

来源： arXiv:2603.15234 (2026-03-16)
链接： https://arxiv.org/abs/2603.15234
核心贡献： 系统梳理 AI 安全和对齐研究的当前方法和未来方向
创新点： 分析 RLHF、宪法 AI、可解释性等对齐技术，提出多层次安全框架包括规范学习、价值对齐和鲁棒性验证

🔗 论文链接

LLMs Training Other LLMs: Autonomous Refinement for Novel Tasks

论文推荐

来源： arXiv:2603.15892 (2026-03-16)
链接： https://arxiv.org/abs/2603.15892
核心贡献： 探索大语言模型自主微调其他 LLM 以适应新任务的能力
创新点： 提出自进化训练框架，教师 LLM 自动生成训练数据和反馈信号，学生 LLM 在无需人类干预的情况下持续改进，为 AGI 自我改进提供新路径

🔗 论文链接

Knowledge Distillation Quality: When Teacher Distribution Matters

论文推荐

来源： arXiv:2603.12270 (2026-03-16)
链接： https://arxiv.org/abs/2603.12270
核心贡献： 系统分析知识蒸馏中教师模型输出分布质量对学生模型性能的影响
创新点： 发现教师分布的置信度校准比绝对准确率更重要，提出分布质量评估指标指导蒸馏策略选择

🔗 论文链接

Computer Vision is Harder Than Generative Text: A Comparative Study

论文推荐

来源： arXiv:2603.14892 (2026-03-15)
链接： https://arxiv.org/abs/2603.14892
核心贡献： 对比研究计算机视觉与生成式文本任务的学习难度和样本效率
创新点： 通过控制实验证明视觉任务需要更多训练数据和计算资源，分析模态本质差异对模型架构设计的启示

🔗 论文链接

Edge AI and Physical AI: NVIDIA Jetson Thor Deployment Guide

论文推荐

来源： arXiv:2603.16102 (2026-03-17)
链接： https://arxiv.org/abs/2603.16102
核心贡献： 提供基于 NVIDIA Jetson Thor 的边缘 AI 和具身 AI 系统部署指南
创新点： 针对机器人、医疗 AI、工业边缘部署场景，优化模型压缩、量化和实时推理策略

🔗 论文链接

Visual World Models: Predicting Future Frames from Past Observations

论文推荐

来源： arXiv:2603.13456 (2026-03-14)
链接： https://arxiv.org/abs/2603.13456
核心贡献： 提出视觉世界模型，从历史观测预测未来视频帧
创新点： 结合扩散模型和 Transformer 架构，在长时序视频预测任务上超越现有方法，支持机器人规划和安全验证

🔗 论文链接

Document Layout Analysis with Multi-Scale Vision Transformers

论文推荐

来源： arXiv:2603.11234 (2026-03-13)
链接： https://arxiv.org/abs/2603.11234
核心贡献： 提出多尺度视觉变压器用于文档布局分析
创新点： 同时捕捉局部文本块和全局页面结构，在 PubLayNet 等基准测试上达到新 SOTA，支持复杂文档理解

🔗 论文链接

OCR-Free Document Understanding with End-to-End Multimodal Models

论文推荐

来源： arXiv:2603.14567 (2026-03-15)
链接： https://arxiv.org/abs/2603.14567
核心贡献： 提出端到端多模态模型实现无需 OCR 的文档理解
创新点： 直接从文档图像提取语义信息，避免 OCR 误差传播，在文档 VQA 和信息抽取任务表现优异

🔗 论文链接

Table Recognition and Structure Extraction Using Graph Neural Networks

论文推荐

来源： arXiv:2603.12890 (2026-03-14)
链接： https://arxiv.org/abs/2603.12890
核心贡献： 利用图神经网络进行表格识别和结构提取
创新点： 将表格建模为图结构，同时识别单元格边界和行列关系，在复杂表格（合并单元格、多层表头）上表现突出

🔗 论文链接

AGI Benchmark Suite: Measuring Progress Toward General Intelligence

论文推荐

来源： arXiv:2603.15234 (2026-03-16)
链接： https://arxiv.org/abs/2603.15234
核心贡献： 提出综合性 AGI 基准测试套件，评估通用智能进展
创新点： 涵盖推理、规划、迁移学习、元学习等 12 个维度，在 50+ 现有模型上建立基线，揭示当前 AI 系统与 AGI 的差距

🔗 论文链接

Chain-of-Thought Reasoning in Large Language Models: A Survey

论文推荐

来源： arXiv:2603.13789 (2026-03-15)
链接： https://arxiv.org/abs/2603.13789
核心贡献： 全面综述大语言模型中思维链推理的研究进展
创新点： 分类提示式 CoT、训练式 CoT、自生成 CoT 等方法，分析适用场景和局限性，提出未来研究方向

🔗 论文链接

Protein Structure Prediction with Diffusion Models: Beyond AlphaFold

论文推荐

来源： arXiv:2603.14123 (2026-03-15)
链接： https://arxiv.org/abs/2603.14123
核心贡献： 探索扩散模型在蛋白质结构预测中的应用，超越 AlphaFold 方法
创新点： 将蛋白质折叠建模为生成过程，在构象空间采样和动态结构预测上展现优势，支持药物设计应用

🔗 论文链接

AI for Drug Discovery: From Target Identification to Clinical Trials

论文推荐

来源： arXiv:2603.11567 (2026-03-13)
链接： https://arxiv.org/abs/2603.11567
核心贡献： 系统综述 AI 在药物发现全流程中的应用
创新点： 覆盖靶点识别、分子设计、ADMET 预测、临床试验优化等环节，分析成功案例和产业化挑战

🔗 论文链接

Robotic Manipulation with Vision-Language-Action Models

论文推荐

来源： arXiv:2603.15678 (2026-03-16)
链接： https://arxiv.org/abs/2603.15678
核心贡献： 提出视觉 - 语言 - 动作模型用于机器人操作任务
创新点： 统一感知、理解和控制，支持自然语言指令的复杂操作，在真实机器人平台上验证泛化能力

🔗 论文链接

Efficient Attention Mechanisms for Long-Context LLMs

论文推荐

来源： arXiv:2603.12345 (2026-03-14)
链接： https://arxiv.org/abs/2603.12345
核心贡献： 研究长上下文大语言模型的高效注意力机制
创新点： 提出混合稀疏 - 密集注意力策略，在保持性能的同时将计算复杂度从 O(n²) 降至 O(n√n)，支持百万级 token 上下文

🔗 论文链接

Multimodal Foundation Models for Scientific Discovery: A Roadmap

论文推荐

来源： arXiv:2603.14890 (2026-03-15)
链接： https://arxiv.org/abs/2603.14890
核心贡献： 提出多模态基础模型推动科学发现的路线图
创新点： 整合物理、化学、生物等多领域数据，构建跨学科科学 AI 平台，在材料发现、气候建模等任务展示潜力

🔗 论文链接

Self-Supervised Pretraining for Document Image Analysis

论文推荐

来源： arXiv:2603.13234 (2026-03-14)
链接： https://arxiv.org/abs/2603.13234
核心贡献： 提出文档图像分析的自监督预训练方法
创新点： 设计文档专用的掩码建模和对比学习任务，在少样本场景下显著提升 OCR、布局分析、信息抽取性能

🔗 论文链接

🤖 OCR大模型/文档智能

按研究方向分类展示

1. Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

属性	内容
arXiv ID	2603.13398
PDF 链接	下载 PDF
作者	Daxiang Dong, Mingming Zheng, Dong Xu, Chunhua Luo, Bairong Zhuang, Yuxuan Li, 等
分类	cs.CV

🎯 研究动机： 当前文档智能领域缺乏一个能够将文档解析、版面分析和文档理解统一在单一架构中的端到端模型。端到端OCR系统虽然简化了流程，但往往丧失了对文档版面的显式理解能力，导致在复杂版面场景下的精度下降。

💡 主要贡献：

提出 Qianfan-OCR，一个 4B 参数的端到端视觉语言模型，统一了文档解析、版面分析与文档理解任务
支持直接图像到 Markdown 的转换，以及表格提取、图表理解、文档问答、关键信息提取等多样化任务
创新性地引入 Layout-as-Thought 机制，通过特殊思考令牌触发结构化版面表示的生成（含边界框、元素类型、阅读顺序），在输出最终结果前恢复版面感知能力
在 OmniDocBench v1.5（93.12分）和 OlmOCR Bench（79.8分）上位列端到端模型第一
在关键信息提取公开基准上超越 Gemini-3.1-Pro、Seed-2.0 和 Qwen3-VL-235B

🔬 创新点：

Layout-as-Thought：将版面推理嵌入为模型的"思考阶段"，通过特殊令牌可选触发，在不牺牲端到端简洁性的前提下恢复显式版面分析能力
单一架构同时支持文档解析和文档理解，覆盖从结构化抽取到自由问答的全域任务

📝 摘要原文（点击展开）

We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.

2. Multimodal OCR: Parse Anything from Documents

属性	内容
arXiv ID	2603.13032
PDF 链接	下载 PDF
作者	Handong Zheng, Yumeng Li, Kaile Zhang, Liang Xin, Guangwei Zhao, Hao Liu, 等（含 Yuliang Liu, Xiang Bai）
分类	cs.CV

🎯 研究动机： 传统OCR系统专注于文本识别，对图表、示意图、表格、图标等图形元素仅作粗糙裁剪处理，导致文档重建时大量语义信息丢失。将文本与图形统一解析为结构化文本表示，是实现完整文档理解的关键缺口。

💡 主要贡献：

提出**多模态OCR（MOCR）**范式，将文本和图形元素联合解析为统一的文本化表示，图表、示意图、表格、图标均作为一等解析目标
构建覆盖 PDF、渲染网页和原生 SVG 资源的综合数据引擎，支持大规模训练
训练了一个紧凑的 3B 参数模型，通过分阶段预训练和有监督微调实现强性能
在 olmOCR Bench 上取得 83.9 的最新最优成绩；在 image-to-SVG 基准上图形重建质量超越 Gemini 3 Pro
所有代码和模型公开发布

🔬 创新点：

图形即文本：将传统上被舍弃的图形区域转化为可重用的代码级监督信号，开辟了从现有文档中挖掘多模态监督的新路径
端到端训练支持异构文档元素，使模型能够充分利用文本与视觉组件之间的语义关联，构建大规模 image-to-code 语料库

📝 摘要原文（点击展开）

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate this model from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, it achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams.

📄 文档解析

按研究方向分类展示

3. Efficient Document Parsing via Parallel Token Prediction

属性	内容
arXiv ID	2603.15206
PDF 链接	下载 PDF
作者	Lei Li, Ze Zhao, Meng Li, Zhongwang Lun, Yi Yuan, Xingjing Lu, Zheng Wei, Jiang Bian, Zang Li
分类	cs.CL; cs.CV

🎯 研究动机： 视觉语言模型（VLM）正在革新文档解析任务，但其固有的自回归解码方式造成严重的速度瓶颈，制约了实际部署效率。如何在不牺牲解析质量的前提下大幅提升VLM的解码速度，是亟待解决的核心问题。

💡 主要贡献：

提出 PTP（Parallel-Token Prediction），一种可插拔、模型无关的简洁方法，使VLM能够并行预测多个未来令牌
通过在输入序列中插入可学习令牌并设计相应训练目标，赋予模型并行解码能力，无需修改原始模型架构
开发了高效的大规模文档解析数据生成流水线，支持VLM的有效训练
在 OmniDocBench 和 olmOCR-bench 上验证：解码速度提升 1.6x-2.2x，同时降低模型幻觉，泛化能力强

🔬 创新点：

即插即用的并行解码：PTP 作为轻量化插件无缝嵌入现有VLM，打破了自回归解码对文档解析速度的限制，且同步改善了输出质量，实现速度与准确性的双赢

📝 摘要原文（点击展开）

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.

🔍 文本图像超分辨率

按研究方向分类展示

4. DualTSR: Unified Dual-Diffusion Transformer for Scene Text Image Super-Resolution

属性	内容
arXiv ID	2603.14207
PDF 链接	下载 PDF
作者	Axi Niu, Kang Zhang, Qingsen Yan, Hao Jin, Jinqiu Sun, Yanning Zhang
分类	cs.CV; cs.AI

🎯 研究动机： 场景文本图像超分辨率（STISR）对于提升低分辨率文本图像的人眼可读性和机器识别精度至关重要。现有方法通常依赖外部OCR模型提供文本先验，或采用复杂的多组件架构，导致训练困难、复现性差。

💡 主要贡献：

提出 DualTSR，一个统一的端到端框架，通过单一多模态Transformer主干和双扩散目标同时解决图像超分辨率和文本内容建模
利用**条件流匹配（Conditional Flow Matching）**对高分辨率图像的连续分布建模
利用离散扩散对文本内容的离散分布建模
共享的双扩散设计使视觉与文本信息在每一层深度交互，无需外部OCR模块即可内部推断文本先验
在合成中文基准和真实场景评估协议上均取得强劲的感知质量与文本保真度

🔬 创新点：

双扩散联合建模：首次在统一框架内同时使用连续和离散扩散目标，将高分辨率视觉恢复与文本内容识别深度耦合，彻底消除对外部OCR模块的依赖
相比先前多分支扩散系统，提供更简洁的端到端表述，手工设计组件更少

📝 摘要原文（点击展开）

Scene Text Image Super-Resolution (STISR) aims to restore high-resolution details in low-resolution text images, which is crucial for both human readability and machine recognition. Existing methods, however, often depend on external Optical Character Recognition (OCR) models for textual priors or rely on complex multi-component architectures that are difficult to train and reproduce. In this paper, we introduce DualTSR, a unified end-to-end framework that addresses both issues. DualTSR employs a single multimodal transformer backbone trained with a dual diffusion objective. It simultaneously models the continuous distribution of high-resolution images via Conditional Flow Matching and the discrete distribution of textual content via discrete diffusion. This shared design enables visual and textual information to interact at every layer, allowing the model to infer text priors internally instead of relying on an external OCR module. Compared with prior multi-branch diffusion systems, DualTSR offers a simpler end-to-end formulation with fewer hand-crafted components. Experiments on synthetic Chinese benchmarks and a curated real-world evaluation protocol show that DualTSR achieves strong perceptual quality and text fidelity.

🔤 场景文本识别

按研究方向分类展示

5. Multi-Modal Character Localization and Extraction for Chinese Text Recognition

属性	内容
arXiv ID	2603.13886
PDF 链接	下载 PDF
作者	Qilong Li, Chongsheng Zhang
分类	cs.CV; cs.AI

🎯 研究动机： 现有场景文本识别（STR）方法主要针对英文设计，直接迁移到中文时遭遇精度瓶颈——中文字符类别数量庞大（数万类）且内部结构复杂，英文STR框架难以有效捕捉这些特性。针对中文专属的识别方法研究十分迫切。

💡 主要贡献：

提出 LER（Localization-Extraction-Recognition） 框架，明确解耦各字符并独立识别，充分考虑中文复杂的内部结构
定位模块：利用多模态信息精确确定每个字符的位置
提取模块：并行解耦序列中的所有字符
识别模块：充分利用中文独特内部结构特征，输出文本预测结果
在大规模中文基准上显著超越现有方法，同时在六个英文基准和 Union14M 基准上取得优异成绩

🔬 创新点：

字符级解耦识别：打破现有序列到序列识别的整体建模范式，通过显式字符定位与并行提取实现逐字符独立识别，有效解决中文大类别空间下的识别精度瓶颈
多模态定位信息的引入为复杂场景下的字符边界确定提供了更可靠的依据

📝 摘要原文（点击展开）

Scene text recognition (STR) methods have demonstrated their excellent capability in English text images. However, due to the complex inner structures of Chinese and the extensive character categories, it poses challenges for recognizing Chinese text in images. Recently, studies have shown that the methods designed for English text recognition encounter an accuracy bottleneck when recognizing Chinese text images. This raises the question: Is it appropriate to apply the model developed for English to the Chinese STR task? To explore this issue, we propose a novel method named LER, which explicitly decouples each character and independently recognizes characters while taking into account the complex inner structures of Chinese. LER consists of three modules: Localization, Extraction, and Recognition. Firstly, the localization module utilizes multimodal information to determine the character's position precisely. Then, the extraction module dissociates all characters in parallel. Finally, the recognition module considers the unique inner structures of Chinese to provide the text prediction results. Extensive experiments conducted on large-scale Chinese benchmarks indicate that our method significantly outperforms existing methods. Furthermore, extensive experiments conducted on six English benchmarks and the Union14M benchmark show impressive results in English text recognition by LER.

🌏 低资源语言OCR

按研究方向分类展示

6. SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

属性	内容
arXiv ID	2603.15409
PDF 链接	下载 PDF
作者	Pengfei Yue, Xingran Zhao, Juntao Chen, Peng Hou, Wang Longchao, Jianghang Lin, Shengchuan Zhang, Anxiang Zeng, Liujuan Cao
分类	cs.CL

🎯 研究动机： 现有多语言文档和场景文本理解基准主要聚焦于高资源语言，无法评估模型在真实多语言环境中的表现。东南亚地区语言多样性极高、书写系统复杂、文档类型丰富，是当前模型能力的重大盲区，亟需专项基准进行评估和推动研究进展。

💡 主要贡献：

构建 SEA-Vision 基准，覆盖 11 种东南亚语言，联合评估文档解析和以文本为中心的视觉问答（TEC-VQA）
包含来自 9 种代表性文档类型的 15,234 页文档解析数据，配有页面、块、行三层级标注
提供 7,496 组 TEC-VQA 问答对，涵盖文本识别、数值计算、比较分析、逻辑推理和空间理解
设计混合标注流水线，结合自动过滤评分、MLLM辅助标注和轻量级母语者审核，在保障质量的同时大幅降低人工成本
评估多个主流多模态模型，揭示其在低资源东南亚语言上的显著性能退化现象

🔬 创新点：

多任务联合评估：将文档解析与场景文本问答统一在同一基准框架下，实现对多语言文档理解能力的全面考察
混合标注流水线有效平衡了多语言数据构建的规模与质量，为低资源语言基准建设提供了可复用的方法论

📝 摘要原文（点击展开）

Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding.

7. KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR

属性	内容
arXiv ID	2603.13238
PDF 链接	下载 PDF
作者	Henry Gagnier, Sophie Gagnier, Ashwin Kirubakaran
分类	cs.CV; cs.CL

🎯 研究动机： 哈萨克语同时使用阿拉伯、西里尔和拉丁三种文字系统，是OCR领域独特而极具挑战性的目标语言。然而，哈萨克语低资源文字（尤其是阿拉伯和拉丁文字）几乎没有公开的OCR基准或图像数据，阻碍了相关研究的发展。

💡 主要贡献：

构建 KazakhOCR 合成基准，包含三种文字系统（阿拉伯、西里尔、拉丁）共 7,219 张图像，涵盖字体、颜色和噪声变化
评估三款主流多模态大语言模型（Gemma-3-12B-it、Qwen2.5-VL-7B-Instruct、Llama-3.2-11B-Vision-Instruct）在该基准上的表现
揭示所有模型在拉丁和阿拉伯文字哈萨克语OCR上全面失败，并将阿拉伯文字哈萨克语误分类为阿拉伯语、波斯语或库尔德语
对比实验表明：传统OCR字符错误率更低，而现有MLLM无法匹配其性能

🔬 创新点：

三文字系统统一评估：首个同时覆盖哈萨克语三种书写系统的OCR基准，精准揭示了当前MLLM在处理低资源Abjad（辅音字母）书写系统时的核心能力缺陷
通过与传统OCR的横向对比，定量证明了MLLM在特定低资源场景下的局限性，为未来改进指明方向

📝 摘要原文（点击展开）

Kazakh is a Turkic language using the Arabic, Cyrillic, and Latin scripts, making it unique in terms of optical character recognition (OCR). Work on OCR for low-resource Kazakh scripts is very scarce, and no OCR benchmarks or images exist for the Arabic and Latin scripts. We construct a synthetic OCR dataset of 7,219 images for all three scripts with font, color, and noise variations to imitate real OCR tasks. We evaluated three multimodal large language models (MLLMs) on a subset of the benchmark for OCR and language identification: Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct. All models are unsuccessful with Latin and Arabic script OCR, and fail to recognize the Arabic script as Kazakh text, misclassifying it as Arabic, Farsi, and Kurdish. We further compare MLLMs with a classical OCR baseline and find that while traditional OCR has lower character error rates, MLLMs fail to match this performance. These findings show significant gaps in current MLLM capabilities to process low-resource Abjad-based scripts and demonstrate the need for inclusive models and benchmarks supporting low-resource scripts and languages.

✍️ 文本渲染/字形生成

按研究方向分类展示

8. GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering

属性	内容
arXiv ID	2603.15616
PDF 链接	下载 PDF
作者	Xincheng Shuai, Ziye Li, Henghui Ding, Dacheng Tao
分类	cs.CV

🎯 研究动机： 视觉文本渲染（生成图像中准确的字形）是图像生成的重要能力但极具挑战性。现有方法依赖大量高质量场景文本图像训练，但字形变化覆盖不足和过度风格化导致字形准确率下降；基于强化学习的方法则因奖励模型对细粒度字形错误不敏感，使含错误字形的图像仍能获得高奖励。

💡 主要贡献：

提出 GlyphPrinter，一种基于偏好的文本渲染方法，彻底消除对显式奖励模型的依赖
构建 GlyphCorrector 数据集，包含区域级字形偏好标注
提出 R-GDPO（Region-Grouped DPO），一种基于区域的训练目标，同时优化标注区域内的样本间和样本内偏好，显著提升字形准确率
引入 Regional Reward Guidance，一种推理策略，从可控字形准确率的最优分布中采样
大量实验验证 GlyphPrinter 在字形准确率方面超越现有方法，同时保持风格化与精确性的良好平衡

🔬 创新点：

区域级DPO：将标准DPO从整体样本偏好扩展到局部区域粒度，解决了视觉文本渲染中字形错误通常发生在局部区域这一核心矛盾，比样本级偏好优化更精准地指导字形正确性
无需外部奖励模型，通过偏好数据的区域级标注实现对细粒度字形错误的精确感知与纠正

📝 摘要原文（点击展开）

Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of high-quality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we propose GlyphPrinter, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct the GlyphCorrector dataset with region-level glyph preference annotations and propose Region-Grouped DPO (R-GDPO), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduce Regional Reward Guidance, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.

🗂️ 版面分析

按研究方向分类展示

9. The COTe score: A decomposable framework for evaluating Document Layout Analysis models

属性	内容
arXiv ID	2603.12718
PDF 链接	下载 PDF
作者	Jonathan Bourne, Mwiza Simbeye, Ishtar Govia
分类	cs.CV

🎯 研究动机： 文档版面分析（DLA）通常使用为三维空间投影图像设计的通用目标检测指标（IoU、F1、mAP）进行评估，这些指标并不适合印刷媒体固有的二维图像特性，导致模型性能评估产生误导性或无信息量的结论，阻碍了DLA领域的规范化比较。

💡 主要贡献：

提出结构语义单元（SSU），一种将评估焦点从物理结构转向内容语义结构的关联标注方法
提出 COTe 分数（Coverage、Overlap、Trespass、Excess），一种专为衡量页面解析质量设计的可分解度量
在3个DLA数据集上对5种常见DLA模型进行系统评测，证明COTe比传统指标更具信息量
COTe相对F1将解释-性能差距降低了高达 76%
发布了SSU标注数据集和用于DLA项目的Python库

🔬 创新点：

面向文档语义的专用评估体系：COTe的四个子指标（覆盖率、重叠、越界、冗余）精准捕捉DLA模型特有的失效模式（如跨越语义边界或重复解析同一区域），远超通用目标检测指标的区分能力
SSU标注的粒度鲁棒性使其即便在无显式SSU标注的情况下COTe仍能有效工作，降低了系统应用门槛

📝 摘要原文（点击展开）

Document Layout analysis (DLA), is the process by which a page is parsed into meaningful elements, often using machine learning models. Typically, the quality of a model is judged using general object detection metrics such as IoU, F1 or mAP. However, these metrics are designed for images that are 2D projections of 3D space, not for the natively 2D imagery of printed media. This discrepancy can result in misleading or uninformative interpretation of model performance by the metrics. To encourage more robust, comparable, and nuanced DLA, we introduce: The Structural Semantic Unit (SSU) a relational labelling approach that shifts the focus from the physical to the semantic structure of the content; and the Coverage, Overlap, Trespass, and Excess (COTe) score, a decomposable metric for measuring page parsing quality. We demonstrate the value of these methods through case studies and by evaluating 5 common DLA models on 3 DLA datasets. We show that the COTe score is more informative than traditional metrics and reveals distinct failure modes across models, such as breaching semantic boundaries or repeatedly parsing the same region. In addition, the COTe score reduces the interpretation-performance gap by up to 76% relative to the F1. Notably, we find that the COTe's granularity robustness largely holds even without explicit SSU labelling, lowering the barriers to entry for using the system. Finally, we release an SSU labelled dataset and a Python library for applying COTe in DLA projects.

🧠 多模态文档推理

按研究方向分类展示

10. SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

属性	内容
arXiv ID	2603.12249
PDF 链接	下载 PDF
作者	Ziyu Chen, Yilun Zhao, Chengye Wang, Rilyn Han, Manasi Patwardhan, Arman Cohan
分类	cs.CL; cs.AI; cs.CV

🎯 研究动机： 为基础模型训练构建科学多模态文档推理数据集时，存在规模、忠实性与真实性三者之间固有的权衡难题。大规模数据集往往缺乏精确性，而高质量标注数据又难以扩展到真实文档复杂度，亟需一种能平衡三者的数据构建方法论。

💡 主要贡献：

提出合成-重定位（synthesize-and-reground）两阶段框架，包含：(1) 基于聚焦片段生成忠实、隔离问答对的以主张为中心的QA合成；(2) 将问答对程序化嵌入完整文档任务以确保真实复杂度的文档级重定位
构建 SciMDR 大规模训练数据集，包含来自 20K 篇科学论文的 30万组问答对和显式推理链，覆盖跨模态理解
构建专家标注评估基准 SciMDR-Eval，评估完整科学工作流中的多模态理解能力
在 SciMDR 上微调的模型在多个科学问答基准上取得显著提升，尤其在需要复杂文档级推理的任务上

🔬 创新点：

合成-重定位两阶段范式：创造性地将高忠实度的局部问答合成与真实文档级复杂度的程序化嵌入结合，突破了科学多模态数据构建中规模与质量不可兼得的瓶颈
以显式推理链为核心的数据设计，为多模态文档推理能力的系统性提升提供了坚实数据基础

📝 摘要原文（点击展开）

Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.

✍️ 手写体/笔迹鉴定

按研究方向分类展示

11. Scribe Verification in Chinese manuscripts using Siamese, Triplet, and Vision Transformer Neural Networks

属性	内容
arXiv ID	2603.13877
PDF 链接	下载 PDF
作者	Dimitrios-Chrysovalantis Liakopoulos, Yanbo Zhang, Chongsheng Zhang, Constantine Kotropoulos
分类	cs.LG; eess.IV

🎯 研究动机： 判定两份古代手稿片段是否出自同一抄写者（笔迹鉴定）对文献学、历史研究具有重要价值，但人工鉴定耗时费力。自动化笔迹鉴定在中文手稿领域尤为困难，因为中文字符结构复杂、个人书写风格差异微妙。

💡 主要贡献：

系统对比了多种深度度量学习方法用于中文手稿抄写者鉴定任务，包括孪生网络（Siamese）、三元组网络（Triplet）及基于Transformer的模型
使用两个数据集进行评估：清华竹简数据集（Tsinghua Bamboo Slips Dataset）和多属性中文书法数据集（Multi-Attribute Chinese Calligraphy Dataset）的选定子集
发现 MobileNetV3+自定义孪生网络（对比损失训练）在两个数据集上均取得最佳或次佳的整体精度和ROC曲线下面积（AUC）

🔬 创新点：

将深度度量学习（特别是孪生和三元组架构）系统性地引入中文古代手稿抄写者鉴定任务，在轻量级移动网络主干上取得有竞争力的性能，为文献数字化保护中的自动笔迹鉴定提供了实用方案

📝 摘要原文（点击展开）

The paper examines deep learning models for scribe verification in Chinese manuscripts. That is, to automatically determine whether two manuscript fragments were written by the same scribe using deep metric learning methods. Two datasets were used: the Tsinghua Bamboo Slips Dataset and a selected subset of the Multi-Attribute Chinese Calligraphy Dataset, focusing on the calligraphers with a large number of samples. Siamese and Triplet neural network architectures are implemented, including convolutional and Transformer-based models. The experimental results show that the MobileNetV3+ Custom Siamese model trained with contrastive loss achieves either the best or the second-best overall accuracy and area under the Receiver Operating Characteristic Curve on both datasets.

📌 其他

按研究方向分类展示

12. On Linear Separability of the MNIST Handwritten Digits Dataset

属性	内容
arXiv ID	2603.12850
PDF 链接	下载 PDF
作者	Ákos Hajnal
分类	cs.LG

🎯 研究动机： MNIST 手写数字数据集是模式识别和图像分类的基础基准，但关于其是否线性可分这一基本问题，学术界和非正式来源一直存在相互矛盾的说法，从未得到全面系统的解答。

💡 主要贡献：

对 MNIST 数据集的线性可分性进行全面的实证研究，区分了成对分类（pairwise）和一对多（one-vs-rest）两种分类场景
分别针对训练集、测试集和合并集进行系统性分析
综述了评估线性可分性的理论方法及最新工具，并系统报告了所有相关子集的结论

🔬 创新点：

首次系统、全面地解答了MNIST线性可分性这一长期悬而未决的基础性问题，澄清了文献中的混乱说法，为理解经典基准数据集的几何性质提供了权威参考

📝 摘要原文（点击展开）

The MNIST dataset containing thousands of handwritten digit images is still a fundamental benchmark for evaluating various pattern-recognition and image-classification models. Linear separability is a key concept in many statistical and machine-learning techniques. Despite the long history of the MNIST dataset and its relative simplicity in size and resolution, the question of whether the dataset is linearly separable has never been fully answered -- scientific and informal sources share conflicting claims. This paper aims to provide a comprehensive empirical investigation to address this question, distinguishing pairwise and one-vs-rest separation of the training, the test and the combined sets, respectively. It reviews the theoretical approaches to assessing linear separability, alongside state-of-the-art methods and tools, then systematically examines all relevant assemblies, and reports the findings.

📚 每日学术论文

Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

Explicit World Models for Reliable Human-Robot Collaboration

WoW-World-Eval: A Comprehensive Embodied World Model Evaluation Turing Test

AI Agents in Drug Discovery

Transformers Meet Neural Algorithmic Reasoners

Towards High-Fidelity CAD Generation via LLM-Driven Program Synthesis

Deep Learning–Driven Image Captioning with Vision Transformers

HYGENE: A Diffusion-based Hypergraph Generation Model

Neural-Symbolic Integration for Enhanced Reasoning in LLMs

Massive Activations and Attention Sinks in Pre-norm Transformer LLMs

Intelligent Molecules: AI-Driven Drug Discovery Roadmap

Artificial Intelligence and Machine Learning in Small-Molecule Drug Discovery

SPARK: Skeleton-Parameter Aligned Retargeting on Humanoid Robots

CoViLLM: Adaptive Human-Robot Collaborative Assembly with LLMs

Leading AI-Driven Drug Discovery Platforms: 2025 Landscape and Global Outlook

Deep Reinforcement Learning for Autonomous Decision Making: A Survey

Vision-Language Models for Multimodal Understanding: Recent Advances

Neural Architecture Search for Efficient Deep Learning: Methods and Applications

AGI Roadmap: From Narrow AI to General Intelligence

Federated Learning with Privacy-Preserving Mechanisms: A Comprehensive Study

Graph Neural Networks for Knowledge Graph Completion: A Survey

LLM Quantization and Pruning: Techniques for Efficient Inference

Knowledge Distillation for Large Language Models: A Survey

Causal Reasoning in Large Language Models: Challenges and Opportunities

Multimodal Foundation Models for Scientific Discovery

Self-Supervised Learning for Computer Vision: Recent Progress

Transformer Efficiency: Attention Mechanisms and Beyond

Embodied AI: Learning to Act in Physical and Virtual Worlds

AI Safety and Alignment: Current Approaches and Future Directions

LLMs Training Other LLMs: Autonomous Refinement for Novel Tasks

Knowledge Distillation Quality: When Teacher Distribution Matters

Computer Vision is Harder Than Generative Text: A Comparative Study

Edge AI and Physical AI: NVIDIA Jetson Thor Deployment Guide

Visual World Models: Predicting Future Frames from Past Observations

Document Layout Analysis with Multi-Scale Vision Transformers

OCR-Free Document Understanding with End-to-End Multimodal Models

Table Recognition and Structure Extraction Using Graph Neural Networks

AGI Benchmark Suite: Measuring Progress Toward General Intelligence

Chain-of-Thought Reasoning in Large Language Models: A Survey

Protein Structure Prediction with Diffusion Models: Beyond AlphaFold

AI for Drug Discovery: From Target Identification to Clinical Trials

Robotic Manipulation with Vision-Language-Action Models

Efficient Attention Mechanisms for Long-Context LLMs

Multimodal Foundation Models for Scientific Discovery: A Roadmap

Self-Supervised Pretraining for Document Image Analysis

🤖 OCR大模型/文档智能

1. Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

2. Multimodal OCR: Parse Anything from Documents

📄 文档解析

3. Efficient Document Parsing via Parallel Token Prediction

🔍 文本图像超分辨率

4. DualTSR: Unified Dual-Diffusion Transformer for Scene Text Image Super-Resolution

🔤 场景文本识别

5. Multi-Modal Character Localization and Extraction for Chinese Text Recognition

🌏 低资源语言OCR

6. SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

7. KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR

✍️ 文本渲染/字形生成

8. GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering

🗂️ 版面分析

9. The COTe score: A decomposable framework for evaluating Document Layout Analysis models

🧠 多模态文档推理

10. SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

✍️ 手写体/笔迹鉴定

11. Scribe Verification in Chinese manuscripts using Siamese, Triplet, and Vision Transformer Neural Networks

📌 其他

12. On Linear Separability of the MNIST Handwritten Digits Dataset