docs.title

docs.subtitle

docs.whatYouLearn

docs.whatYouLearnText

docs.downloadNotebook docs.quickStartBtn

docs.quickStart.title

docs.quickStart.subtitle

1. docs.quickStart.step1.title

docs.quickStart.step1.description

2. docs.quickStart.step2.title

docs.quickStart.step2.description

3. docs.quickStart.step3.title

docs.quickStart.step3.description

4. docs.quickStart.step4.title

docs.quickStart.step4.description

# First cell - Setup and check dependencies
!pip install -q pythainlp pandas numpy
from lib import PipelineManager
print("🚀 Ready to process your data!")

docs.installation.title

docs.installation.prerequisites

Python 3.8+: Required for all processing
Jupyter Notebook: Or Google Colab for interactive use
Internet connection: For package installation and LLM features

💡 No Installation Required! The notebook auto-installs all dependencies. Just download and run!

docs.pipelineSteps.title

docs.pipelineSteps.subtitle

docs.pipelineSteps.step1.title

docs.pipelineSteps.step1.description

docs.pipelineSteps.step2.title

docs.pipelineSteps.step2.description

docs.pipelineSteps.step3.title

docs.pipelineSteps.step3.description

docs.pipelineSteps.step4.title

docs.pipelineSteps.step4.description

docs.pipelineSteps.step5.title

docs.pipelineSteps.step5.description

docs.pipelineSteps.step6.title

docs.pipelineSteps.step6.description

docs.thaiLanguage.title

docs.thaiLanguage.subtitle

🔤 Text Processing Features

Word Segmentation: Using PyThaiNLP with newmm and attacut tokenizers
Syllable Counting: Accurate Thai syllable detection for model training
Text Normalization: Clean and standardize Thai text automatically
Mixed Language Support: Handle Thai-English mixed content seamlessly

# Thai text processing example
processor = ThaiTextProcessor()
text = "สวัสดี Hello โลก World"
cleaned = processor.normalize(text)
tokens = processor.tokenize(cleaned)
syllables = processor.count_syllables(cleaned)

📤 Export Formats

Choose from multiple training-ready formats:

🦙 Open-Source Model Formats

Format	Use Case	Example Output
Alpaca/Llama	Instruction tuning	`{"instruction": "...", "input": "...", "output": "..."}`
ShareGPT	Conversation training	`{"conversations": [{"from": "human", "value": "..."}]}`
Vicuna	Chat models	`{"id": "...", "conversations": [...]}`

💡 Examples

📄 PDF to Training Data

# 1. Upload a PDF annual report
# 2. Enable page-by-page extraction  
# 3. Generate 8 Thai training examples per page
# 4. Export as Alpaca format

# Result: 1,000+ training examples from 120-page report

💬 CSV to Conversation Data

# 1. Upload customer service CSV
# 2. Map columns: question → human, answer → assistant  
# 3. Generate variations with LLM enhancement
# 4. Export as ShareGPT format

# Result: 5,000+ conversation pairs for chatbot training

🎉 You're Ready! You now have everything needed to transform any data into training-ready datasets. Start with the notebook and follow the 6-step pipeline.

📓 Download Notebook 🏠 Back to Home