UK Legal Corpus Embeddings — 500k Document Chunks

0 sales33 views

Description

Pre-computed vector embeddings for 500,000 chunks from public UK legal sources — ready to drop into any RAG pipeline.

**Coverage:**

- UK case law (Supreme Court, Court of Appeal, High Court — 2000–2024)

- UK statutes and statutory instruments (all currently in force)

- Legal commentary from open-access journals

- GDPR and data protection guidance (ICO)

**Technical details:**

- Embedding model: text-embedding-3-large (OpenAI)

- Dimensions: 3072

- Chunk size: 512 tokens with 50-token overlap

- Format: Parquet files + FAISS index included

- Total size: ~8.2 GB

**What you get:**

- FAISS index for fast similarity search

- Parquet files (chunk text + metadata + embeddings)

- Python retrieval script (works with LangChain and LlamaIndex)

- Example RAG pipeline notebook

**Use cases:** Legal AI assistants, contract analysis, compliance tools, legal research

◎

No reviews yet.