langchain遇到一个 chroma bug

xiaowangxiansen · 2025 年12 月 31 日 07:08

使用chromadb 1.4.0 使用vector_store = Chroma.from_documents(test_documents, embeddings) 没有抛出错误只有进程已结束，退出代码为 -1073741819 (0xC0000005)

admin · 2026 年1 月 2 日 06:41

在 Windows 上使用 chromadb 1.4.0（发布于 2025 年 12 月）时遇到进程直接退出且代码为 -1073741819 (0xC0000005)，这是一个典型的 Access Violation（访问冲突/段错误） 。

这意味着 Python 调用的底层 C++/Rust 库试图访问未分配的内存。在 Windows 环境下使用向量数据库时，这类问题通常由以下几个原因导致。

以下是针对 chromadb 1.4.0 环境的解决方案，按推荐顺序排列：

1. 降级 ChromaDB 版本（最推荐）

chromadb 1.4.0 是一个较新的大版本（包含 Rust 重写核心等重大变更），而 langchain 或 langchain-chroma 的包装器可能尚未完全适配该版本的二进制接口，导致调用时发生内存崩溃。

建议降级到非常稳定的 0.5.x 版本系列：

codeBash

pip install chromadb==0.5.23
# 或者
pip install chromadb==0.5.5

降级后请重新运行代码，通常能直接解决该崩溃问题。

2. 检查 onnxruntime 依赖

如果你使用的是默认的嵌入模型（即没有在 from_documents 中传入自定义的 embedding_function，或者传入的是使用 ONNX 的模型），崩溃很可能是由 onnxruntime 和 CPU 指令集（如 AVX）不兼容导致的。

尝试手动安装/更新 onnxruntime：

codeBash

pip install --upgrade onnxruntime

如果不使用 GPU，确保没有安装 onnxruntime-gpu 造成冲突。

3. 减小写入批次大小 (Batch Size)

Windows 对内存映射文件的处理较为敏感。如果 test_documents 的数量较大（例如超过 1000 条），一次性写入可能会触发底层 HNSW 索引构建的崩溃。

请尝试限制每批处理的文档数量，不要直接调用 from_documents 处理所有数据，而是分批添加：

codePython

from langchain_chroma import Chroma

# 初始化一个空的 Chroma 实例
vector_store = Chroma(
    collection_name="my_collection",
    embedding_function=embeddings,  # 确保这里传入了你的 embedding 函数
    persist_directory="./chroma_db"
)

# 分批添加文档 (例如每次 100 个)
batch_size = 100
for i in range(0, len(test_documents), batch_size):
    batch = test_documents[i : i + batch_size]
    vector_store.add_documents(batch)
    print(f"Inserted batch {i} to {i+batch_size}")

print("Done")

4. 避免 sqlite3 版本冲突

ChromaDB 依赖特定版本的 sqlite3。Windows 自带的 Python 有时会加载系统 DLL 中的旧版 SQLite，导致新版 Chroma 崩溃。
你可以尝试安装 pysqlite3-binary 并覆盖系统默认库：

codeBash

pip install pysqlite3-binary

然后在你的代码最开头 （所有 import 之前）添加：

codePython

__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

import chromadb
# ... 后续代码

错误代码 0xC0000005 几乎总是由 二进制依赖冲突 引起的。鉴于 1.4.0 较新，方案 1（降级到 0.5.x） 是最稳妥的快速修复方法。如果必须使用新版本，请务必尝试 方案 3（分批写入） 。