No preview available
UK Legal Corpus Embeddings — 500k Document Chunks
Description
Pre-computed vector embeddings for 500,000 chunks from public UK legal sources — ready to drop into any RAG pipeline.
**Coverage:**
- UK case law (Supreme Court, Court of Appeal, High Court — 2000–2024)
- UK statutes and statutory instruments (all currently in force)
- Legal commentary from open-access journals
- GDPR and data protection guidance (ICO)
**Technical details:**
- Embedding model: text-embedding-3-large (OpenAI)
- Dimensions: 3072
- Chunk size: 512 tokens with 50-token overlap
- Format: Parquet files + FAISS index included
- Total size: ~8.2 GB
**What you get:**
- FAISS index for fast similarity search
- Parquet files (chunk text + metadata + embeddings)
- Python retrieval script (works with LangChain and LlamaIndex)
- Example RAG pipeline notebook
**Use cases:** Legal AI assistants, contract analysis, compliance tools, legal research