Experimental Results & Model Performance
Repository & Resources
Model Performance Summary
Key Findings
- Model Size Impact: Models with fewer than 7B parameters lack basic understanding of the task and struggle to output correctly structured data. Team consensus is to use models no smaller than 7B parameters.
- Prompt Engineering: Removing CSV format references from prompts improves model performance, as the format was confusing some models.
- Conversation Clustering Optimization: The human-labeled data contains one topic per conversation. Prompts should be adjusted to account for this when optimizing for benchmarks.
- Data Processing Improvements:
- Added functionality to remove
<think>
segments in deep thinking models (e.g., deepseek
)
- Fixed issue with
message_id
handling for small models
- Hardware Requirements: More GPU capacity is needed to enable parallel experimentation. Current setup with 1 weak GPU makes parallel testing impossible.
- Day 3 Results, Findings & Improvement Suggestions
Technical Improvements
Script Enhancements
- Prompt Preprocessing:
- Added script to remove
<thinking>
in preprocessing CSV files
- Implemented in both
playground_http
and model_playground.py