docs.title

docs.subtitle

docs.whatYouLearn

docs.whatYouLearnText

docs.quickStart.title

docs.quickStart.subtitle

1. docs.quickStart.step1.title

docs.quickStart.step1.description

2. docs.quickStart.step2.title

docs.quickStart.step2.description

3. docs.quickStart.step3.title

docs.quickStart.step3.description

4. docs.quickStart.step4.title

docs.quickStart.step4.description

# First cell - Setup and check dependencies !pip install -q pythainlp pandas numpy from lib import PipelineManager print("🚀 Ready to process your data!")

docs.installation.title

docs.installation.prerequisites

  • Python 3.8+: Required for all processing
  • Jupyter Notebook: Or Google Colab for interactive use
  • Internet connection: For package installation and LLM features
💡 No Installation Required! The notebook auto-installs all dependencies. Just download and run!

docs.pipelineSteps.title

docs.pipelineSteps.subtitle

docs.pipelineSteps.step1.title

docs.pipelineSteps.step1.description

docs.pipelineSteps.step2.title

docs.pipelineSteps.step2.description

docs.pipelineSteps.step3.title

docs.pipelineSteps.step3.description

docs.pipelineSteps.step4.title

docs.pipelineSteps.step4.description

docs.pipelineSteps.step5.title

docs.pipelineSteps.step5.description

docs.pipelineSteps.step6.title

docs.pipelineSteps.step6.description

docs.thaiLanguage.title

docs.thaiLanguage.subtitle

🔤 Text Processing Features

  • Word Segmentation: Using PyThaiNLP with newmm and attacut tokenizers
  • Syllable Counting: Accurate Thai syllable detection for model training
  • Text Normalization: Clean and standardize Thai text automatically
  • Mixed Language Support: Handle Thai-English mixed content seamlessly
# Thai text processing example processor = ThaiTextProcessor() text = "สวัสดี Hello โลก World" cleaned = processor.normalize(text) tokens = processor.tokenize(cleaned) syllables = processor.count_syllables(cleaned)

📤 Export Formats

Choose from multiple training-ready formats:

🦙 Open-Source Model Formats

FormatUse CaseExample Output
Alpaca/LlamaInstruction tuning{"instruction": "...", "input": "...", "output": "..."}
ShareGPTConversation training{"conversations": [{"from": "human", "value": "..."}]}
VicunaChat models{"id": "...", "conversations": [...]}

💡 Examples

📄 PDF to Training Data

# 1. Upload a PDF annual report # 2. Enable page-by-page extraction # 3. Generate 8 Thai training examples per page # 4. Export as Alpaca format # Result: 1,000+ training examples from 120-page report

💬 CSV to Conversation Data

# 1. Upload customer service CSV # 2. Map columns: question → human, answer → assistant # 3. Generate variations with LLM enhancement # 4. Export as ShareGPT format # Result: 5,000+ conversation pairs for chatbot training
🎉 You're Ready! You now have everything needed to transform any data into training-ready datasets. Start with the notebook and follow the 6-step pipeline.