|
|
@ -1,215 +0,0 @@
|
|
|
|
Metadata-Version: 2.2
|
|
|
|
|
|
|
|
Name: m3docrag
|
|
|
|
|
|
|
|
Version: 0.0.1
|
|
|
|
|
|
|
|
Summary: Multimodal Document Understanding with RAG
|
|
|
|
|
|
|
|
Classifier: Programming Language :: Python :: 3
|
|
|
|
|
|
|
|
Requires-Python: >=3.10
|
|
|
|
|
|
|
|
Description-Content-Type: text/markdown
|
|
|
|
|
|
|
|
Requires-Dist: accelerate==1.1.0
|
|
|
|
|
|
|
|
Requires-Dist: loguru
|
|
|
|
|
|
|
|
Requires-Dist: requests
|
|
|
|
|
|
|
|
Requires-Dist: setuptools==69.5
|
|
|
|
|
|
|
|
Requires-Dist: transformers
|
|
|
|
|
|
|
|
Requires-Dist: tokenizers
|
|
|
|
|
|
|
|
Requires-Dist: flash-attn==2.5.8
|
|
|
|
|
|
|
|
Requires-Dist: bitsandbytes==0.43.1
|
|
|
|
|
|
|
|
Requires-Dist: safetensors
|
|
|
|
|
|
|
|
Requires-Dist: gpustat
|
|
|
|
|
|
|
|
Requires-Dist: icecream
|
|
|
|
|
|
|
|
Requires-Dist: pdf2image
|
|
|
|
|
|
|
|
Requires-Dist: numpy==1.26.4
|
|
|
|
|
|
|
|
Requires-Dist: torchvision
|
|
|
|
|
|
|
|
Requires-Dist: jsonlines
|
|
|
|
|
|
|
|
Requires-Dist: editdistance
|
|
|
|
|
|
|
|
Requires-Dist: einops
|
|
|
|
|
|
|
|
Requires-Dist: fire
|
|
|
|
|
|
|
|
Requires-Dist: peft
|
|
|
|
|
|
|
|
Requires-Dist: timm
|
|
|
|
|
|
|
|
Requires-Dist: sentencepiece
|
|
|
|
|
|
|
|
Requires-Dist: colpali-engine==0.3.1
|
|
|
|
|
|
|
|
Requires-Dist: easyocr
|
|
|
|
|
|
|
|
Requires-Dist: qwen-vl-utils
|
|
|
|
|
|
|
|
Requires-Dist: faiss-cpu
|
|
|
|
|
|
|
|
Requires-Dist: word2number
|
|
|
|
|
|
|
|
Requires-Dist: datasets>=3.0.0
|
|
|
|
|
|
|
|
Requires-Dist: python-dotenv
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# M3DocRAG
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Code for [M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding](https://m3docrag.github.io/)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
by [Jaemin Cho](https://j-min.io/), [Debanjan Mahata](https://sites.google.com/a/ualr.edu/debanjan-mahata/), [Ozan İrsoy](https://wtimesx.com/), [Yujie He](https://scholar.google.com/citations?user=FbeAZGgAAAAJ&hl=en), [Mohit Bansal](https://www.cs.unc.edu/~mbansal/)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Summary
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Comparison with previous approches
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<img src='./assets/m3docrag_teaser.png' >
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Comparison of multi-modal document understanding pipelines. Previous works focus on (a) **Single-page DocVQA** that cannot handle many long documents or (b) **Text-based RAG** that ignores visual information. Our (c) **M3DocRAG** framework retrieves relevant documents and answers questions using multi-modal retrieval and MLM components, so that it can efficiently handle many long documents while preserving visual information.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## M3DocRAG framework
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<img src='./assets/method.png' >
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Our **M3DocRAG** framework consists of three stages: (1) document embedding, (2) page retrieval, and (3) question answering.
|
|
|
|
|
|
|
|
- In (1) document embedding, we extract visual embedding (with ColPali) to represent each page from all PDF documents.
|
|
|
|
|
|
|
|
- In (2) page retrieval, we retrieve the top-K pages of high relevance (MaxSim scores) with text queries. In an open-domain setting, we create approximate page indices for faster search.
|
|
|
|
|
|
|
|
- In (3) question answering, we conduct visual question answering with multi-modal LM (e.g. Qwen2-VL) to obtain the final answer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Setup
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Package
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We assume conda has been installed
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
|
|
git clone <REPO_URL>
|
|
|
|
|
|
|
|
cd m3docrag-release
|
|
|
|
|
|
|
|
pip install -e .
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Install Poppler (for pdf2image; check https://pdf2image.readthedocs.io/en/latest/installation.html for details)
|
|
|
|
|
|
|
|
# conda install -y poppler
|
|
|
|
|
|
|
|
# or
|
|
|
|
|
|
|
|
# apt-get install poppler-utils
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Code structure
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
|
|
examples/ # scripts to run PDF embedding / RAG
|
|
|
|
|
|
|
|
src/m3docrag/
|
|
|
|
|
|
|
|
datasets/ # data loader for existing datasets
|
|
|
|
|
|
|
|
retrieval/ # retrieval model (e.g., ColPaLi)
|
|
|
|
|
|
|
|
vqa/ # vqa model (e.g., Qwen2-VL)
|
|
|
|
|
|
|
|
rag/ # RAG model that combines retrieval and vqa models
|
|
|
|
|
|
|
|
utils/ # misc utility methods
|
|
|
|
|
|
|
|
m3docvqa/ # how to setup m3docvqa dataset
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
## Paths: Data, Embeddings, Model checkpoints, Outputs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
|
|
# in .env
|
|
|
|
|
|
|
|
LOCAL_DATA_DIR="/job/datasets" # where to store data
|
|
|
|
|
|
|
|
LOCAL_EMBEDDINGS_DIR="/job/embeddings" # where to store embeddings
|
|
|
|
|
|
|
|
LOCAL_MODEL_DIR="/job/model" # where to store model checkpoints
|
|
|
|
|
|
|
|
LOCAL_OUTPUT_DIR="/job/output" # where to store model outputs
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
You can adjust variables in [`.env`](.env) to change where to store data/embedding/model checkpoint/outputs by default. They are loaded in [`src/m3docrag/utils/paths.py`](./src/m3docrag/utils/paths.py) via [python-dotenv](https://github.com/theskumar/python-dotenv).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Download M3DocVQA dataset
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Please see [m3docvqa/README.md](m3docvqa/README.md) for the download instruction.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Donwload model checkpoints
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
By default, we use colpali-v1.2 for retrival and Qwen2-VL-7B-Instruct for question answering.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
At `$LOCAL_MODEL_DIR`, download [colpali-v1.2](https://huggingface.co/vidore/colpali-v1.2), [colpaligemma-3b-mix-448-base](https://huggingface.co/vidore/colpaligemma-3b-mix-448-base) and [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) checkpoints.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
|
|
cd $LOCAL_MODEL_DIR
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
git clone https://huggingface.co/vidore/colpaligemma-3b-pt-448-base # ColPali backbone
|
|
|
|
|
|
|
|
git clone https://huggingface.co/vidore/colpali-v1.2 # ColPali adapter
|
|
|
|
|
|
|
|
git clone https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct # VQA
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Example usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Below we describe example usage of M3DocRAG on M3DocVQA dataset.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## 1. Extract PDF embeddings
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
|
|
DATASET_NAME="m3-docvqa"
|
|
|
|
|
|
|
|
RETRIEVAL_MODEL_TYPE="colpali"
|
|
|
|
|
|
|
|
RETRIEVAL_MODEL_NAME="colpaligemma-3b-pt-448-base"
|
|
|
|
|
|
|
|
RETRIEVAL_ADAPTER_MODEL_NAME="colpali-v1.2"
|
|
|
|
|
|
|
|
SPLIT="dev"
|
|
|
|
|
|
|
|
EMBEDDING_NAME=$RETRIEVAL_ADAPTER_MODEL_NAME"_"$DATASET_NAME"_"$SPLIT # where to save embeddings
|
|
|
|
|
|
|
|
accelerate launch --num_processes=1 --mixed_precision=bf16 examples/run_page_embedding.py \
|
|
|
|
|
|
|
|
--use_retrieval \
|
|
|
|
|
|
|
|
--retrieval_model_type=$RETRIEVAL_MODEL_TYPE \
|
|
|
|
|
|
|
|
--data_name=$DATASET_NAME \
|
|
|
|
|
|
|
|
--split=$SPLIT \
|
|
|
|
|
|
|
|
--loop_unique_doc_ids=True \
|
|
|
|
|
|
|
|
--output_dir=/job/embeddings/$EMBEDDING_NAME \
|
|
|
|
|
|
|
|
--retrieval_model_name_or_path=$RETRIEVAL_MODEL_NAME \
|
|
|
|
|
|
|
|
--retrieval_adapter_model_name_or_path=$RETRIEVAL_ADAPTER_MODEL_NAME
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## 2. Indexing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
|
|
DATASET_NAME="m3-docvqa"
|
|
|
|
|
|
|
|
RETRIEVAL_MODEL_TYPE="colpali"
|
|
|
|
|
|
|
|
RETRIEVAL_ADAPTER_MODEL_NAME="colpali-v1.2"
|
|
|
|
|
|
|
|
SPLIT="dev"
|
|
|
|
|
|
|
|
FAISS_INDEX_TYPE='ivfflat'
|
|
|
|
|
|
|
|
EMBEDDING_NAME=$RETRIEVAL_ADAPTER_MODEL_NAME"_"$DATASET_NAME"_"$SPLIT
|
|
|
|
|
|
|
|
INDEX_NAME=$EMBEDDING_NAME"_pageindex_"$FAISS_INDEX_TYPE # where to save resulting index
|
|
|
|
|
|
|
|
echo $EMBEDDING_NAME
|
|
|
|
|
|
|
|
echo $FAISS_INDEX_TYPE
|
|
|
|
|
|
|
|
python examples/run_indexing_m3docvqa.py \
|
|
|
|
|
|
|
|
--use_retrieval \
|
|
|
|
|
|
|
|
--retrieval_model_type=$RETRIEVAL_MODEL_TYPE \
|
|
|
|
|
|
|
|
--data_name=$DATASET_NAME \
|
|
|
|
|
|
|
|
--split=$SPLIT \
|
|
|
|
|
|
|
|
--loop_unique_doc_ids=False \
|
|
|
|
|
|
|
|
--embedding_name=$EMBEDDING_NAME \
|
|
|
|
|
|
|
|
--faiss_index_type=$FAISS_INDEX_TYPE \
|
|
|
|
|
|
|
|
--output_dir=/job/embeddings/$INDEX_NAME
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## 3. RAG
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
|
|
BACKBONE_MODEL_NAME="Qwen2-VL-7B-Instruct"
|
|
|
|
|
|
|
|
RETRIEVAL_MODEL_TYPE="colpali"
|
|
|
|
|
|
|
|
RETRIEVAL_MODEL_NAME="colpaligemma-3b-pt-448-base"
|
|
|
|
|
|
|
|
RETRIEVAL_ADAPTER_MODEL_NAME="colpali-v1.2"
|
|
|
|
|
|
|
|
EMBEDDING_NAME="colpali-v1.2_m3-docvqa_dev" # from Step 1 Embedding
|
|
|
|
|
|
|
|
SPLIT="dev"
|
|
|
|
|
|
|
|
DATASET_NAME="m3-docvqa"
|
|
|
|
|
|
|
|
FAISS_INDEX_TYPE='ivfflat'
|
|
|
|
|
|
|
|
N_RETRIEVAL_PAGES=1
|
|
|
|
|
|
|
|
INDEX_NAME="${EMBEDDING_NAME}_pageindex_$FAISS_INDEX_TYPE" # from Step 2 Indexing
|
|
|
|
|
|
|
|
OUTPUT_SAVE_NAME="${RETRIEVAL_ADAPTER_MODEL_NAME}_${BACKBONE_MODEL_NAME}_${DATASET_NAME}" # where to save RAG results
|
|
|
|
|
|
|
|
BITS=16 # BITS=4 for 4-bit qunaitzation in low memory GPUs
|
|
|
|
|
|
|
|
python examples/run_rag_m3docvqa.py \
|
|
|
|
|
|
|
|
--use_retrieval \
|
|
|
|
|
|
|
|
--retrieval_model_type=$RETRIEVAL_MODEL_TYPE \
|
|
|
|
|
|
|
|
--load_embedding=True \
|
|
|
|
|
|
|
|
--split=$SPLIT \
|
|
|
|
|
|
|
|
--bits=$BITS \
|
|
|
|
|
|
|
|
--n_retrieval_pages=$N_RETRIEVAL_PAGES \
|
|
|
|
|
|
|
|
--data_name=$DATASET_NAME \
|
|
|
|
|
|
|
|
--model_name_or_path=$BACKBONE_MODEL_NAME \
|
|
|
|
|
|
|
|
--embedding_name=$EMBEDDING_NAME \
|
|
|
|
|
|
|
|
--retrieval_model_name_or_path=$RETRIEVAL_MODEL_NAME \
|
|
|
|
|
|
|
|
--retrieval_adapter_model_name_or_path=$RETRIEVAL_ADAPTER_MODEL_NAME \
|
|
|
|
|
|
|
|
--output_dir=/job/eval_outputs/$OUTPUT_SAVE_NAME
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Please cite our paper if you use our dataset and/or method in your projects.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
```bibtex
|
|
|
|
|
|
|
|
@article{Cho2024M3DocRAG,
|
|
|
|
|
|
|
|
author = {Jaemin Cho and Ozan İrsoy and Debanjan Mahata and Yujie He and Mohit Bansal},
|
|
|
|
|
|
|
|
title = {M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding},
|
|
|
|
|
|
|
|
year = {2024},
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
```
|
|
|
|
|