Prepare high-quality datasets for fine-tuning language models.
## Data Collection Strategies
### Dataset Requirements Quality > Quantity
Minimum viable dataset: - Simple tasks: 50-100 examples - Complex tasks: 500-1000 examples - Domain expertise: 1000-5000 examples - Production quality: 5000+ examples
### Data Formats Chat format (recommended): ```json {"messages": [ {"role": "system", "content": "You are a helpful legal assistant."}, {"role": "user", "content": "What is a tort?"}, {"role": "assistant", "content": "A tort is a civil wrong..."} ]} ```
Instruction format: ```json {"instruction": "Summarize this contract", "input": "...", "output": "..."} ```
## Data Quality Pipeline
```python class DataQualityPipeline: def process(self, raw_data): # Step 1: Deduplication data = self.deduplicate(raw_data, threshold=0.85) # Step 2: Quality filtering data = self.filter_quality(data, min_score=0.7) # Step 3: Format validation data = self.validate_format(data) # Step 4: PII removal data = self.remove_pii(data) # Step 5: Length filtering data = self.filter_length(data, min_tokens=10, max_tokens=4096) return data ```
## Synthetic Data Generation ```python def generate_training_data(seed_examples, target_count): synthetic_data = [] for seed in seed_examples: variations = llm.generate( f"Create 5 variations of this example: {seed}", temperature=0.8 ) synthetic_data.extend(variations) filtered = quality_filter(synthetic_data, threshold=0.8) return filtered[:target_count] ```