DVS - DuckDB Vector Similarity Search
A Python library for vector similarity search powered by DuckDB and OpenAI embeddings.
Features
- Fast Vector Search: Efficient similarity search using DuckDB's vector capabilities
- OpenAI Integration: Automatic embedding generation with OpenAI models
- Caching: Built-in embedding cache for improved performance
- Simple API: Easy-to-use Python interface
- Flexible Storage: Store documents with metadata
Installation
pip install dvs-py
Quick Start
Basic Usage
import asyncio
import tempfile
import openai_embeddings_model as oai_emb_model
from dvs import DVS
# Initialize DVS with a database file and model
dvs = DVS(
tempfile.NamedTemporaryFile(suffix=".duckdb").name,
model="text-embedding-3-small",
model_settings=oai_emb_model.ModelSettings(dimensions=1536)
)
# Add documents
dvs.add("Apple announced new iPhone features with upgraded camera and A16 chip.")
dvs.add("Microsoft updated Azure with enhanced AI tools and security features.")
# Search
results = asyncio.run(dvs.search("What are the new iPhone features?"))
print(f"Found {len(results)} results")
for point, document, score in results:
print(f"Score: {score:.3f} - {document.content[:100]}...")
Advanced Configuration
import asyncio
import pathlib
import diskcache
import openai
import openai_embeddings_model as oai_emb_model
from dvs import DVS
# Configure with custom cache and model settings
dvs = DVS(
"./my_database.duckdb",
model=oai_emb_model.OpenAIEmbeddingsModel(
model="text-embedding-3-small",
openai_client=openai.OpenAI(),
cache=diskcache.Cache("./cache/embeddings.cache"),
),
model_settings=oai_emb_model.ModelSettings(dimensions=1536),
verbose=True
)
# Add documents with metadata
from dvs.types.document import Document
doc = Document.from_content(
"Latest developments in artificial intelligence...",
name="AI Research Paper",
metadata={"author": "John Doe", "year": 2024}
)
dvs.add(doc)
# Search with more results
results = asyncio.run(dvs.search("artificial intelligence", top_k=10))
Configuration
Set your OpenAI API key:
export OPENAI_API_KEY="your-api-key"
Document Management
Adding Documents
# Add single document
dvs.add("Your document content here")
# Add multiple documents
documents = [
"First document content",
"Second document content",
"Third document content"
]
dvs.add(documents)
# Add documents with metadata
from dvs.types.document import Document
docs = [
Document.from_content("Content 1", name="Doc 1", metadata={"category": "tech"}),
Document.from_content("Content 2", name="Doc 2", metadata={"category": "science"})
]
dvs.add(docs)
Searching Documents
# Basic search
results = asyncio.run(dvs.search("your query"))
# Search with more results
results = asyncio.run(dvs.search("your query", top_k=10))
# Search with embeddings included
results = asyncio.run(dvs.search("your query", with_embedding=True))
Removing Documents
# Get document ID from search results
results = asyncio.run(dvs.search("some query"))
doc_id = results[0][1].document_id
# Remove document
dvs.remove(doc_id)
# Remove multiple documents
dvs.remove([doc_id1, doc_id2, doc_id3])
Development
Install development dependencies:
make install-all
Run tests:
make pytest
Format code:
make format-all
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Support
If you encounter any issues or have questions, please open an issue on GitHub.
GraphRAG Strategies (Overview)
Below is a concise overview of four GraphRAG strategies used in dvs. Each diagram mirrors the inline docstrings to aid quick understanding.
Strategy 1: Vector Expansion
- High-level: Vector expansion search and ranking.
flowchart TD
Q[Query] --> Expand[perform_vector_expansion_search]
Expand --> Rank[Rank and Format]
Rank --> TopK[Top-k Results]
Strategy 2: Graph-Guided
- High-level: PageRank-guided candidate expansion and ranking.
flowchart TD
Q[Query] --> PR[PageRank RelatedTo]
PR --> Expand[Expand and Collect]
Expand --> Score[Score and Rank]
Score --> TopK[Top-k Results]
Strategy 3: Iterative Refinement
- High-level: Iterate baseline + LLM expansions until no improvement.
flowchart TD
Q[Query] --> Base[Baseline Vector Expansion]
Base --> Loop{Improved}
Loop -- Yes --> LLM[LLM Query Expand]
LLM --> Refine[Refined Vector Expansion]
Refine --> Loop
Loop -- No --> TopK[Top-k Results]
Strategy 4: Context-Aware
- High-level: Baseline search with context-aware scoring and filtering.
flowchart TD
Q[Query] --> Base[Baseline Search]
Base --> Score[Context-Aware Score and Filter]
Score --> TopK[Top-k Results]