docs.title
docs.subtitle
docs.whatYouLearn
docs.whatYouLearnText
docs.quickStart.title
docs.quickStart.subtitle
1. docs.quickStart.step1.title
docs.quickStart.step1.description
2. docs.quickStart.step2.title
docs.quickStart.step2.description
3. docs.quickStart.step3.title
docs.quickStart.step3.description
4. docs.quickStart.step4.title
docs.quickStart.step4.description
# First cell - Setup and check dependencies
!pip install -q pythainlp pandas numpy
from lib import PipelineManager
print("🚀 Ready to process your data!")
docs.installation.title
docs.installation.prerequisites
- Python 3.8+: Required for all processing
- Jupyter Notebook: Or Google Colab for interactive use
- Internet connection: For package installation and LLM features
💡 No Installation Required! The notebook auto-installs all dependencies. Just download and run!
docs.pipelineSteps.title
docs.pipelineSteps.subtitle
docs.pipelineSteps.step1.title
docs.pipelineSteps.step1.description
docs.pipelineSteps.step2.title
docs.pipelineSteps.step2.description
docs.pipelineSteps.step3.title
docs.pipelineSteps.step3.description
docs.pipelineSteps.step4.title
docs.pipelineSteps.step4.description
docs.pipelineSteps.step5.title
docs.pipelineSteps.step5.description
docs.pipelineSteps.step6.title
docs.pipelineSteps.step6.description
docs.thaiLanguage.title
docs.thaiLanguage.subtitle
🔤 Text Processing Features
- Word Segmentation: Using PyThaiNLP with newmm and attacut tokenizers
- Syllable Counting: Accurate Thai syllable detection for model training
- Text Normalization: Clean and standardize Thai text automatically
- Mixed Language Support: Handle Thai-English mixed content seamlessly
# Thai text processing example
processor = ThaiTextProcessor()
text = "สวัสดี Hello โลก World"
cleaned = processor.normalize(text)
tokens = processor.tokenize(cleaned)
syllables = processor.count_syllables(cleaned)
📤 Export Formats
Choose from multiple training-ready formats:
🦙 Open-Source Model Formats
| Format | Use Case | Example Output |
|---|---|---|
| Alpaca/Llama | Instruction tuning | {"instruction": "...", "input": "...", "output": "..."} |
| ShareGPT | Conversation training | {"conversations": [{"from": "human", "value": "..."}]} |
| Vicuna | Chat models | {"id": "...", "conversations": [...]} |
💡 Examples
📄 PDF to Training Data
# 1. Upload a PDF annual report
# 2. Enable page-by-page extraction
# 3. Generate 8 Thai training examples per page
# 4. Export as Alpaca format
# Result: 1,000+ training examples from 120-page report
💬 CSV to Conversation Data
# 1. Upload customer service CSV
# 2. Map columns: question → human, answer → assistant
# 3. Generate variations with LLM enhancement
# 4. Export as ShareGPT format
# Result: 5,000+ conversation pairs for chatbot training
🎉 You're Ready! You now have everything needed to transform any data into training-ready datasets. Start with the notebook and follow the 6-step pipeline.