42 Hackathon - AI Spam Detection

Experimental Results & Model Performance

Main Repository: https://github.com/cere-io/42-hackathon-nlp-llm-conversation-analytics
Front-end Application: https://conversation-detection.stage.cere.io/
Shared Tracking Sheet: https://docs.google.com/spreadsheets/d/1DEibaw8xmpz7khFlnJcM1_MMoWP0Kd4mZ3429aUF7No/edit?usp=sharing

Model Size Impact: Models with fewer than 7B parameters lack basic understanding of the task and struggle to output correctly structured data. Team consensus is to use models no smaller than 7B parameters.
Prompt Engineering: Removing CSV format references from prompts improves model performance, as the format was confusing some models.
Conversation Clustering Optimization: The human-labeled data contains one topic per conversation. Prompts should be adjusted to account for this when optimizing for benchmarks.
Data Processing Improvements:
- Added functionality to remove <think> segments in deep thinking models (e.g., deepseek)
- Fixed issue with message_id handling for small models
Hardware Requirements: More GPU capacity is needed to enable parallel experimentation. Current setup with 1 weak GPU makes parallel testing impossible.

Prompt Preprocessing:
- Added script to remove <thinking> in preprocessing CSV files
- Implemented in both playground_http and model_playground.py