DVS - DuckDB Vector Similarity Search

A Python library for vector similarity search powered by DuckDB and OpenAI embeddings.

Features

Fast Vector Search: Efficient similarity search using DuckDB's vector capabilities
OpenAI Integration: Automatic embedding generation with OpenAI models
Caching: Built-in embedding cache for improved performance
Simple API: Easy-to-use Python interface
Flexible Storage: Store documents with metadata

Installation

pip install dvs-py

Quick Start

Basic Usage

import asyncio
import tempfile
import openai_embeddings_model as oai_emb_model
from dvs import DVS

# Initialize DVS with a database file and model
dvs = DVS(
    tempfile.NamedTemporaryFile(suffix=".duckdb").name,
    model="text-embedding-3-small",
    model_settings=oai_emb_model.ModelSettings(dimensions=1536)
)

# Add documents
dvs.add("Apple announced new iPhone features with upgraded camera and A16 chip.")
dvs.add("Microsoft updated Azure with enhanced AI tools and security features.")

# Search
results = asyncio.run(dvs.search("What are the new iPhone features?"))
print(f"Found {len(results)} results")
for point, document, score in results:
    print(f"Score: {score:.3f} - {document.content[:100]}...")

Advanced Configuration

import asyncio
import pathlib
import diskcache
import openai
import openai_embeddings_model as oai_emb_model
from dvs import DVS

# Configure with custom cache and model settings
dvs = DVS(
    "./my_database.duckdb",
    model=oai_emb_model.OpenAIEmbeddingsModel(
        model="text-embedding-3-small",
        openai_client=openai.OpenAI(),
        cache=diskcache.Cache("./cache/embeddings.cache"),
    ),
    model_settings=oai_emb_model.ModelSettings(dimensions=1536),
    verbose=True
)

# Add documents with metadata
from dvs.types.document import Document

doc = Document.from_content(
    "Latest developments in artificial intelligence...",
    name="AI Research Paper",
    metadata={"author": "John Doe", "year": 2024}
)
dvs.add(doc)

# Search with more results
results = asyncio.run(dvs.search("artificial intelligence", top_k=10))

Configuration

Set your OpenAI API key:

export OPENAI_API_KEY="your-api-key"

Document Management

Adding Documents

# Add single document
dvs.add("Your document content here")

# Add multiple documents
documents = [
    "First document content",
    "Second document content",
    "Third document content"
]
dvs.add(documents)

# Add documents with metadata
from dvs.types.document import Document

docs = [
    Document.from_content("Content 1", name="Doc 1", metadata={"category": "tech"}),
    Document.from_content("Content 2", name="Doc 2", metadata={"category": "science"})
]
dvs.add(docs)

Searching Documents

# Basic search
results = asyncio.run(dvs.search("your query"))

# Search with more results
results = asyncio.run(dvs.search("your query", top_k=10))

# Search with embeddings included
results = asyncio.run(dvs.search("your query", with_embedding=True))

Removing Documents

# Get document ID from search results
results = asyncio.run(dvs.search("some query"))
doc_id = results[0][1].document_id

# Remove document
dvs.remove(doc_id)

# Remove multiple documents
dvs.remove([doc_id1, doc_id2, doc_id3])

Development

Install development dependencies:

make install-all

Run tests:

make pytest

Format code:

make format-all

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

If you encounter any issues or have questions, please open an issue on GitHub.

GraphRAG Strategies (Overview)

Below is a concise overview of four GraphRAG strategies used in dvs. Each diagram mirrors the inline docstrings to aid quick understanding.

Strategy 1: Vector Expansion

High-level: Vector expansion search and ranking.

flowchart TD
    Q[Query] --> Expand[perform_vector_expansion_search]
    Expand --> Rank[Rank and Format]
    Rank --> TopK[Top-k Results]

Strategy 2: Graph-Guided

High-level: PageRank-guided candidate expansion and ranking.

flowchart TD
    Q[Query] --> PR[PageRank RelatedTo]
    PR --> Expand[Expand and Collect]
    Expand --> Score[Score and Rank]
    Score --> TopK[Top-k Results]

High-level: Iterate baseline + LLM expansions until no improvement.

flowchart TD
    Q[Query] --> Base[Baseline Vector Expansion]
    Base --> Loop{Improved}
    Loop -- Yes --> LLM[LLM Query Expand]
    LLM --> Refine[Refined Vector Expansion]
    Refine --> Loop
    Loop -- No --> TopK[Top-k Results]

Strategy 4: Context-Aware

High-level: Baseline search with context-aware scoring and filtering.

flowchart TD
    Q[Query] --> Base[Baseline Search]
    Base --> Score[Context-Aware Score and Filter]
    Score --> TopK[Top-k Results]