## Data Foundations for AI
### NumPy: Numerical Computing
```python import numpy as np
# Embeddings are NumPy arrays embedding = np.array([0.1, 0.5, -0.3, 0.8])
# Cosine similarity def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
sim = cosine_similarity(embedding1, embedding2)
# Matrix operations for attention attention_scores = np.matmul(queries, keys.T) / np.sqrt(d_k) attention_weights = softmax(attention_scores) ```
### Pandas: Data Preparation
```python import pandas as pd
# Load and explore training data df = pd.read_csv("training_data.csv") print(df.describe()) print(df.isnull().sum())
# Clean and prepare df = df.dropna(subset=["text", "label"]) df["text"] = df["text"].str.lower().str.strip() df["text_length"] = df["text"].apply(len)
# Split for training from sklearn.model_selection import train_test_split train, test = train_test_split(df, test_size=0.2, random_state=42) ```
### Common AI Data Tasks
- Tokenization stats: Analyze token distributions with Pandas
- Embedding analysis: Compute similarities with NumPy
- Dataset balancing: Sample or augment underrepresented classes
- Feature engineering: Create numeric features from text
- Evaluation: Calculate metrics across test sets