Amazon Polly

Amazon Polly คือบริการแปลงข้อความเป็นเสียงพูด (Text-to-Speech - TTS) ที่ใช้ Deep Learning สร้างเสียงที่ฟังดูเป็นธรรมชาติและคล้ายมนุษย์ บริการนี้รองรับกว่า 60 เสียงใน 30+ ภาษา รวมถึงภาษาไทยด้วยเสียง "Naja" (เสียงผู้หญิง) ที่ฟังเป็นธรรมชาติ Polly มีให้เลือก 3 เทคโนโลยี ได้แก่ Standard TTS, Neural TTS (เสียงเป็นธรรมชาติกว่า) และ Long-form TTS (เหมาะสำหรับเนื้อหายาว)

นอกจากการแปลงข้อความเป็นเสียงธรรมดาแล้ว Polly ยังรองรับ SSML (Speech Synthesis Markup Language) ที่ช่วยให้ควบคุมการออกเสียงได้อย่างละเอียด เช่น เน้นคำ หยุดพัก เปลี่ยน pitch ความเร็วในการพูด และแม้กระทั่งออกเสียงแบบกระซิบ รวมถึง Speech Marks ที่ให้ข้อมูล timing ของแต่ละคำสำหรับ lip-sync และ word highlighting

AWS Docs: https://docs.aws.amazon.com/polly/latest/dg/what-is.html

สถาปัตยกรรม

ฟีเจอร์หลัก

Standard TTS Voices

เทคโนโลยี TTS แบบดั้งเดิมที่ใช้ concatenative synthesis ให้เสียงที่ชัดเจนและเข้าใจง่าย มีให้เลือกหลากหลายเสียงในแต่ละภาษา เหมาะสำหรับ use cases ที่ต้องการต้นทุนต่ำและปริมาณมาก เช่น IVR systems, notifications และ basic narration

Neural TTS Voices

ใช้เทคโนโลยี Neural Network ให้เสียงที่เป็นธรรมชาติมากกว่า Standard TTS มีจังหวะการพูดที่ราบรื่น การเน้นคำถูกต้อง และอารมณ์ในเสียงที่สมจริงกว่า เหมาะสำหรับแอปพลิเคชันที่ต้องการประสบการณ์ผู้ใช้ที่ดี เช่น audiobooks, e-learning และ virtual assistants

Long-form TTS Voices

เทคโนโลยีล่าสุดที่เหมาะสำหรับเนื้อหายาวเป็นพิเศษ เช่น บทความข่าว podcast และ audiobooks ยาวๆ ให้เสียงที่เป็นธรรมชาติที่สุด มีการหายใจและจังหวะการพูดที่สมจริง เหมาะสำหรับ professional-grade audio content

Thai Voice - Naja

เสียงภาษาไทยเป็นเสียงผู้หญิงชื่อ "Naja" รองรับทั้ง Standard และ Neural TTS ออกเสียงภาษาไทยได้ถูกต้องตามวรรณยุกต์และการออกเสียงสระ เหมาะสำหรับแอปพลิเคชันที่ต้องการพูดภาษาไทย

SSML (Speech Synthesis Markup Language)

ควบคุมการพูดได้อย่างละเอียดด้วย XML tags มาตรฐาน:

<break> หยุดพักระหว่างคำหรือประโยค
<emphasis> เน้นคำหรือวลี
<prosody rate="slow"> ควบคุมความเร็ว pitch และ volume
<say-as interpret-as="digits"> กำหนดวิธีออกเสียงตัวเลข วันที่ โทรศัพท์
<phoneme> กำหนดการออกเสียงด้วย phonetic notation
<whispered> ออกเสียงแบบกระซิบ
<amazon:effect name="drc"> เพิ่ม dynamic range compression สำหรับสภาพแวดล้อมเสียงดัง
<lang xml:lang="en-US"> เปลี่ยนภาษาภายในข้อความเดียว

Custom Lexicons

สร้าง dictionary ที่กำหนดวิธีออกเสียงคำเฉพาะทางที่ระบบอาจออกเสียงผิด เช่น ชื่อแบรนด์ คำย่อ หรือคำศัพท์เทคนิค รองรับ PLS (Pronunciation Lexicon Specification) format

Speech Marks

ข้อมูล metadata ที่ส่งคืนพร้อมกับไฟล์เสียง ประกอบด้วย timestamp ของแต่ละ word, sentence, viseme (รูปปาก) และ SSML mark ใช้สำหรับ:

Lip-sync ซิงค์การเคลื่อนไหวปากของ avatar กับเสียง
Word Highlighting highlight คำที่กำลังพูดอยู่ใน text ขณะเล่นเสียง
Karaoke-style แสดงเนื้อเพลงตามเวลา

Real-time Streaming

สตรีมเสียงแบบ real-time โดยไม่ต้องรอดาวน์โหลดไฟล์ทั้งหมดก่อน ลด latency สำหรับ interactive applications เช่น voice assistants และ IVR systems รองรับ format MP3, OGG และ PCM

Asynchronous Synthesis Tasks

สร้างไฟล์เสียงขนาดใหญ่แบบ asynchronous สำหรับข้อความยาวๆ บันทึกผลลัพธ์ลง S3 โดยตรง เหมาะสำหรับสร้าง audiobook หรือ podcast ล่วงหน้า

การติดตั้งและการตั้งค่า

1. เปิดใช้งานผ่าน AWS Console

เข้า AWS Console และค้นหา "Amazon Polly"
ทดลองใช้ Text-to-Speech ในหน้า Console ได้ทันที
เลือกเสียง ภาษา และ engine (Standard/Neural/Long-form)

2. ติดตั้ง boto3 และ dependencies

pip install boto3 pygame  # pygame สำหรับเล่นเสียง
# หรือ
pip install boto3 playsound

3. IAM Permissions ที่จำเป็น

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "polly:SynthesizeSpeech",
        "polly:StartSpeechSynthesisTask",
        "polly:GetSpeechSynthesisTask",
        "polly:ListSpeechSynthesisTasks",
        "polly:PutLexicon",
        "polly:GetLexicon",
        "polly:ListLexicons",
        "polly:DescribeVoices"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject"],
      "Resource": "arn:aws:s3:::my-audio-output/*"
    }
  ]
}

4. ตัวอย่างการตั้งค่า boto3

import boto3

polly = boto3.client(
    service_name='polly',
    region_name='ap-southeast-1'
)

วิธีใช้งาน

สังเคราะห์เสียงพื้นฐาน (Neural TTS ภาษาไทย)

import boto3

polly = boto3.client('polly', region_name='ap-southeast-1')

# สังเคราะห์เสียงภาษาไทย
response = polly.synthesize_speech(
    Text="สวัสดีค่ะ ยินดีต้อนรับสู่ Amazon Web Services ประเทศไทย",
    OutputFormat='mp3',
    VoiceId='Naja',        # เสียงภาษาไทย (ผู้หญิง)
    Engine='neural',       # Neural TTS ให้เสียงเป็นธรรมชาติกว่า
    LanguageCode='th-TH'
)

# บันทึกไฟล์เสียง
with open('thai-greeting.mp3', 'wb') as f:
    f.write(response['AudioStream'].read())
print("Audio saved: thai-greeting.mp3")

ใช้ SSML สำหรับการควบคุมขั้นสูง

# ใช้ SSML เพื่อควบคุมการพูด
ssml_text = """
<speak>
    <p>
        ยินดีต้อนรับสู่บริการของเรา
        <break time="500ms"/>
        กรุณาฟังตัวเลือกต่อไปนี้
    </p>
    <p>
        กด <emphasis level="strong">หนึ่ง</emphasis> สำหรับบริการลูกค้า
        <break time="300ms"/>
        กด <emphasis level="strong">สอง</emphasis> สำหรับแผนกบัญชี
        <break time="300ms"/>
        กด <emphasis level="strong">ศูนย์</emphasis> เพื่อคุยกับเจ้าหน้าที่
    </p>
    <p>
        <prosody rate="slow">
            เพื่อฟังซ้ำ กรุณากด ดอกจัน
        </prosody>
    </p>
</speak>
"""

response = polly.synthesize_speech(
    Text=ssml_text,
    TextType='ssml',       # บอกให้รู้ว่าเป็น SSML
    OutputFormat='mp3',
    VoiceId='Naja',
    Engine='neural',
    LanguageCode='th-TH'
)

with open('ivr-menu.mp3', 'wb') as f:
    f.write(response['AudioStream'].read())

ควบคุม Prosody (ความเร็ว ระดับเสียง ความดัง)

ssml_with_prosody = """
<speak>
    <prosody rate="fast" pitch="+5%" volume="loud">
        ข่าวด่วน! ลดราคาสินค้าทุกชิ้น 50% วันนี้เท่านั้น!
    </prosody>
    <break time="1s"/>
    <prosody rate="slow" pitch="-5%" volume="soft">
        สอบถามรายละเอียดเพิ่มเติมได้ที่เคาน์เตอร์ประชาสัมพันธ์
    </prosody>
</speak>
"""

response = polly.synthesize_speech(
    Text=ssml_with_prosody,
    TextType='ssml',
    OutputFormat='mp3',
    VoiceId='Naja',
    Engine='neural',
    LanguageCode='th-TH'
)

Speech Marks สำหรับ Word Highlighting

import json

# รับ Speech Marks ของแต่ละคำ
response = polly.synthesize_speech(
    Text="สวัสดีครับ ผมชื่อสมชาย ทำงานที่ AWS",
    OutputFormat='json',       # Speech Marks ใช้ format json
    VoiceId='Naja',
    LanguageCode='th-TH',
    SpeechMarkTypes=['word', 'sentence']  # ขอ marks ระดับคำและประโยค
)

# Parse Speech Marks
speech_marks = []
for line in response['AudioStream'].read().decode('utf-8').strip().split('\n'):
    if line:
        mark = json.loads(line)
        speech_marks.append(mark)
        print(f"Type: {mark['type']} | Time: {mark['time']}ms | Value: {mark.get('value', '')}")

# ใช้ข้อมูลนี้สำหรับ word highlighting ใน UI

Neural TTS สำหรับหลายภาษา

# รายการเสียงที่ใช้บ่อย
voices = {
    'th': {'id': 'Naja', 'engine': 'neural', 'lang': 'th-TH'},
    'en': {'id': 'Joanna', 'engine': 'neural', 'lang': 'en-US'},
    'ja': {'id': 'Takumi', 'engine': 'neural', 'lang': 'ja-JP'},
    'zh': {'id': 'Zhiyu', 'engine': 'neural', 'lang': 'cmn-CN'},
    'ko': {'id': 'Seoyeon', 'engine': 'neural', 'lang': 'ko-KR'},
}

def text_to_speech(text, language='th', output_file='output.mp3'):
    voice = voices.get(language, voices['en'])
    
    response = polly.synthesize_speech(
        Text=text,
        OutputFormat='mp3',
        VoiceId=voice['id'],
        Engine=voice['engine'],
        LanguageCode=voice['lang']
    )
    
    with open(output_file, 'wb') as f:
        f.write(response['AudioStream'].read())
    return output_file

# สร้างเสียงหลายภาษา
text_to_speech("สวัสดีครับ", 'th', 'greeting-th.mp3')
text_to_speech("Hello, welcome!", 'en', 'greeting-en.mp3')
text_to_speech("こんにちは", 'ja', 'greeting-ja.mp3')

Async Task สำหรับ Audiobook ยาว

# สร้าง audiobook ขนาดใหญ่แบบ async
with open('book-chapter-1.txt', 'r', encoding='utf-8') as f:
    long_text = f.read()

response = polly.start_speech_synthesis_task(
    Text=long_text,
    OutputFormat='mp3',
    VoiceId='Naja',
    Engine='long-form',    # Long-form engine เหมาะสุดสำหรับ audiobook
    LanguageCode='th-TH',
    OutputS3BucketName='my-audiobooks',
    OutputS3KeyPrefix='thai-novel/chapter-1'
)

task_id = response['SynthesisTask']['TaskId']
print(f"Task started: {task_id}")

# ตรวจสอบสถานะ
import time
while True:
    result = polly.get_speech_synthesis_task(TaskId=task_id)
    status = result['SynthesisTask']['TaskStatus']
    print(f"Status: {status}")
    
    if status == 'completed':
        print(f"Output: {result['SynthesisTask']['OutputUri']}")
        break
    elif status == 'failed':
        print(f"Error: {result['SynthesisTask']['TaskStatusReason']}")
        break
    time.sleep(10)

Custom Lexicon สำหรับคำศัพท์เฉพาะ

# สร้าง Custom Lexicon สำหรับออกเสียงคำศัพท์เฉพาะ
lexicon_content = """<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" 
         xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
         xml:lang="th-TH">
  <lexeme>
    <grapheme>AWS</grapheme>
    <alias>เอ ดับเบิลยู เอส</alias>
  </lexeme>
  <lexeme>
    <grapheme>API</grapheme>
    <alias>เอ พี ไอ</alias>
  </lexeme>
  <lexeme>
    <grapheme>CPU</grapheme>
    <alias>ซี พี ยู</alias>
  </lexeme>
</lexicon>"""

# อัปโหลด lexicon
polly.put_lexicon(
    Name='tech-terms-th',
    Content=lexicon_content
)

# ใช้ lexicon ในการสังเคราะห์เสียง
response = polly.synthesize_speech(
    Text="ระบบ AWS ใช้ API และมี CPU ประมวลผล",
    OutputFormat='mp3',
    VoiceId='Naja',
    Engine='neural',
    LanguageCode='th-TH',
    LexiconNames=['tech-terms-th']
)

ราคา (ประมาณการในบาท)

อัตราแลกเปลี่ยน: 1 USD = 35 บาท

Engine	ราคา (USD)	ราคา (บาท)	เหมาะสำหรับ
Standard TTS	$4.00/1M chars	140 บาท/1M ตัวอักษร	IVR, notifications, basic narration
Neural TTS	$16.00/1M chars	560 บาท/1M ตัวอักษร	Apps, e-learning, virtual assistants
Long-form TTS	$100.00/1M chars	3,500 บาท/1M ตัวอักษร	Audiobooks, podcasts, news
Speech Marks	$4.00/1M chars	140 บาท/1M ตัวอักษร	Lip-sync, word highlighting

Free Tier:

Standard Voices: 5,000,000 ตัวอักษร/เดือน ใน 12 เดือนแรก
Neural Voices: 1,000,000 ตัวอักษร/เดือน ใน 12 เดือนแรก

ตัวอย่างการคำนวณ:

สร้างเสียงสำหรับบทเรียน e-learning 100 หลักสูตร เฉลี่ย 50,000 ตัวอักษร/หลักสูตร
รวม = 5M ตัวอักษร ด้วย Neural TTS = $80 (~2,800 บาท)
IVR system รับสาย 10,000 สาย/วัน เฉลี่ย 200 ตัวอักษร/สาย = 2M ตัวอักษร/วัน
ค่าใช้จ่าย Standard = 2M × $4/1M = $8/วัน (~280 บาท/วัน)

เหมาะสำหรับ

ระบบ IVR และ Call Center สำหรับ automated voice response และ phone menus
E-Learning และการศึกษา สร้างเสียงบรรยายบทเรียนในหลายภาษาโดยไม่ต้องจ้างนักพากย์
แอปข่าวและบทความ เพิ่มฟังก์ชัน "อ่านออกเสียง" ให้ผู้ใช้ฟัง content ระหว่างเดินทาง
Accessibility สำหรับผู้พิการทางสายตาที่ต้องการ high-quality screen reader
Audiobooks และ Podcasts สร้าง audio content ด้วยต้นทุนต่ำกว่าการจ้างนักพากย์
Voice Assistants และ Chatbots เพิ่มเสียงตอบกลับที่เป็นธรรมชาติให้ virtual assistant

ใช้ร่วมกับ AWS Services

AWS Service	การผสานรวม
Amazon Translate	แปลข้อความแล้วให้ Polly อ่านออกเสียงในภาษาต่างๆ
Amazon Transcribe	Transcribe เสียงเป็นข้อความ แล้ว Polly อ่านกลับ
Amazon Lex	สร้าง voice chatbot ที่พูดตอบด้วย Polly
Amazon S3	เก็บไฟล์เสียงที่สังเคราะห์แล้ว
Amazon CloudFront	serve audio files ด้วย low latency ทั่วโลก
AWS Lambda	สร้าง TTS pipeline แบบ serverless
Amazon Connect	สร้าง contact center IVR ด้วยเสียงเป็นธรรมชาติ
Amazon DynamoDB	cache audio file URLs เพื่อลด API calls

Use Case ตัวอย่าง

1. ระบบ IVR อัจฉริยะสำหรับ Call Center

ธนาคารขนาดใหญ่ใช้ Polly Neural TTS สร้างเสียง IVR ที่ฟังดูเป็นธรรมชาติแทน pre-recorded voice เดิม ระบบสามารถพูดข้อมูลแบบ dynamic ได้ทันที เช่น ยอดบัญชี วันที่ชำระ และรายการล่าสุด โดยไม่ต้องบันทึกเสียงไว้ล่วงหน้า SSML ถูกใช้เพื่อ format ตัวเลขให้ออกเสียงถูกต้อง เช่น "สองหมื่นห้าพันบาท" แทน "2,5,0,0,0" Amazon Lex รับ voice input ของลูกค้า และ Polly ตอบกลับ ลูกค้าได้รับ experience ที่ราบรื่นกว่าเดิมมาก CSAT score สำหรับ IVR เพิ่มจาก 3.2 เป็น 4.1 จาก 5

2. แพลตฟอร์ม E-Learning หลายภาษา

บริษัท EdTech สร้างบทเรียนใหม่ 50 หลักสูตร/เดือน ในอดีตต้องจ้างนักพากย์สำหรับแต่ละภาษา ปัจจุบันใช้ Polly Neural TTS สร้างเสียงบรรยายใน 6 ภาษาอัตโนมัติ ทันทีที่ทีม content อัปโหลดสคริปต์ Lambda function เรียก Translate เพื่อแปลเนื้อหา จากนั้นเรียก Polly สร้างเสียงในทุกภาษา Speech Marks ถูกใช้เพื่อ highlight คำที่กำลังพูดอยู่ใน transcript ทำให้ผู้เรียนตามได้ง่าย ประหยัดค่าใช้จ่ายด้านการผลิตเสียงกว่า 75% และลดเวลาออก course จาก 2 สัปดาห์เหลือ 2 วัน

3. แอปอ่านข่าวพร้อมเสียง

แอปข่าวยอดนิยมเพิ่มฟีเจอร์ "ฟังข่าว" ให้ผู้ใช้ฟัง article ขณะขับรถหรือออกกำลังกาย ใช้ Polly Long-form TTS สำหรับบทความยาวๆ ที่ต้องการเสียงเป็นธรรมชาติสูงสุด ระบบ pre-generate เสียงสำหรับข่าวยอดนิยมล่วงหน้าและ cache ไว้ใน S3 + CloudFront เพื่อ low latency สำหรับบทความใหม่ที่มีผู้เปิดฟัง Lambda จะสร้างเสียงแบบ on-demand และ cache ผลลัพธ์ ฟีเจอร์นี้เพิ่ม engagement time ต่อผู้ใช้ขึ้น 40% เนื่องจากผู้ใช้สามารถบริโภค content ได้ขณะทำกิจกรรมอื่น

สถาปัตยกรรม​

ฟีเจอร์หลัก​

Standard TTS Voices​

Neural TTS Voices​

Long-form TTS Voices​

Thai Voice - Naja​

SSML (Speech Synthesis Markup Language)​

Custom Lexicons​

Speech Marks​

Real-time Streaming​

Asynchronous Synthesis Tasks​

การติดตั้งและการตั้งค่า​

1. เปิดใช้งานผ่าน AWS Console​

2. ติดตั้ง boto3 และ dependencies​

3. IAM Permissions ที่จำเป็น​

4. ตัวอย่างการตั้งค่า boto3​

วิธีใช้งาน​

สังเคราะห์เสียงพื้นฐาน (Neural TTS ภาษาไทย)​

ใช้ SSML สำหรับการควบคุมขั้นสูง​

ควบคุม Prosody (ความเร็ว ระดับเสียง ความดัง)​

Speech Marks สำหรับ Word Highlighting​

Neural TTS สำหรับหลายภาษา​

Async Task สำหรับ Audiobook ยาว​

Custom Lexicon สำหรับคำศัพท์เฉพาะ​

ราคา (ประมาณการในบาท)​

เหมาะสำหรับ​

ใช้ร่วมกับ AWS Services​

Use Case ตัวอย่าง​

1. ระบบ IVR อัจฉริยะสำหรับ Call Center​

2. แพลตฟอร์ม E-Learning หลายภาษา​

3. แอปอ่านข่าวพร้อมเสียง​