AWS HealthOmics

AWS HealthOmics คือบริการ cloud-native สำหรับจัดเก็บ, ประมวลผล และวิเคราะห์ข้อมูล genomics และ multi-omics ในระดับขนาดใหญ่ บริการนี้ออกแบบมาสำหรับนักวิจัย genomics, บริษัท biotech/pharma และโรงพยาบาลที่ให้บริการ precision medicine โดยรองรับข้อมูล omics ทุกประเภทตั้งแต่ raw sequencing reads ไปจนถึง variants และ annotations

AWS HealthOmics มี workflow engine ที่รัน bioinformatics pipelines มาตรฐาน (WDL, Nextflow, CWL) ได้โดย AWS จัดการ compute infrastructure ทั้งหมด รองรับ Ready2Run workflows เช่น GATK Best Practices, DeepVariant ที่ validated แล้ว บริการนี้ผ่าน HIPAA eligibility และ GxP compliance รองรับการวิจัยทางคลินิกและ regulatory submissions

AWS Docs: https://docs.aws.amazon.com/omics/latest/dev/what-is-service.html

สถาปัตยกรรม

ฟีเจอร์หลัก

Sequence Stores (FASTQ, BAM, CRAM)

จัดเก็บ raw sequencing data ในรูปแบบ FASTQ, BAM และ CRAM อย่างมีประสิทธิภาพ ระบบ compress และ index ข้อมูลอัตโนมัติ ทำให้ค่า storage ลดลงและ retrieval รวดเร็ว รองรับ paired-end reads, long reads (PacBio, Oxford Nanopore) และ single-cell sequencing

Variant Stores (VCF)

เก็บ genomic variants ในรูปแบบ VCF (Variant Call Format) พร้อม lossless compression ที่ลดขนาดได้มาก รองรับ small variants (SNPs, INDELs) และ structural variants รองรับ querying variants ด้วย SQL ผ่าน Amazon Athena เพื่อ large-scale population analysis

Annotation Stores (GFF, TSV, VCF)

เก็บ genomic annotations ในรูปแบบ GFF, BED, TSV และ VCF เช่น gene annotations, regulatory elements, population frequency databases (gnomAD, ClinVar) รองรับ cross-referencing variants กับ annotations เพื่อ functional interpretation

Reference Stores (Genome References)

จัดการ reference genome assemblies เช่น GRCh38, GRCh37, T2T-CHM13 อย่างเป็นระบบ ทุก workflows ใช้ reference จาก central store เพื่อให้แน่ใจว่าใช้ reference เดียวกันทั้งโปรเจกต์ รองรับ AWS public reference genomes ที่ไม่มีค่าใช้จ่าย

Omics Workflows (WDL/Nextflow/CWL)

รัน bioinformatics workflows ที่เขียนด้วย WDL (Workflow Description Language), Nextflow หรือ CWL (Common Workflow Language) โดย AWS จัดการ provisioning compute, scaling และ job scheduling ให้ทั้งหมด ไม่ต้องดูแล HPC cluster

Ready2Run Workflows (GATK, DeepVariant)

มี pre-built, validated workflows สำเร็จรูป ได้แก่:

GATK Best Practices - germline variant calling, somatic variant calling
DeepVariant - deep learning-based variant calling จาก Google
RNA-seq Analysis - transcript quantification และ differential expression
DRAGEN - Illumina's accelerated secondary analysis pipeline

Analytics (Query Variants at Scale with Athena)

Query genomic variants จาก variant stores ด้วย SQL ผ่าน Amazon Athena เพื่อ population-scale analysis เช่น หา variants ที่พบในผู้ป่วยมะเร็ง, วิเคราะห์ allele frequencies ในประชากรไทย หรือ cross-reference กับ ClinVar pathogenicity

แชร์ omics data ระหว่าง AWS accounts ได้อย่างปลอดภัยโดยไม่ต้องย้ายข้อมูล เหมาะสำหรับ consortium research ที่หลายสถาบันต้องการวิเคราะห์ข้อมูลร่วมกันโดยยังคุมการเข้าถึงได้

Compliance (HIPAA, GxP)

ผ่าน HIPAA Business Associate Agreement รองรับ GxP (Good Practice) requirements สำหรับ pharmaceutical research และ clinical trials เหมาะสำหรับ FDA regulatory submissions

การติดตั้งและการตั้งค่า

เปิดใช้งานผ่าน Console

เข้า AWS Console > AWS HealthOmics
สร้าง Reference Store และ upload reference genome
สร้าง Sequence Store และ import sequencing data
สร้าง Variant Store สำหรับเก็บ VCF files
สร้าง Workflow (upload WDL/Nextflow script)
Run workflow และ monitor progress

ติดตั้ง SDK

pip install boto3
# สำหรับ workflow development
pip install miniwdl
npm install -g nextflow

IAM Permissions ที่จำเป็น

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "omics:*",
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket",
        "iam:PassRole",
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    }
  ]
}

วิธีใช้งาน

สร้าง Stores และ Import ข้อมูล

import boto3

omics = boto3.client('omics', region_name='us-east-1')

# สร้าง Reference Store
ref_store_response = omics.create_reference_store(
    name='genome-references',
    description='Human genome reference assemblies',
    sseConfig={
        'type': 'KMS',
        'keyArn': 'arn:aws:kms:us-east-1:123456789:key/my-key'
    }
)
reference_store_id = ref_store_response['id']

# Import reference genome (GRCh38)
ref_import = omics.start_reference_import_job(
    referenceStoreId=reference_store_id,
    roleArn='arn:aws:iam::123456789:role/OmicsRole',
    sources=[
        {
            'sourceFile': 's3://my-omics-bucket/references/hg38.fa',
            'name': 'GRCh38',
            'description': 'Human Reference Genome GRCh38/hg38',
            'tags': {'version': 'hg38', 'organism': 'human'}
        }
    ]
)

# สร้าง Sequence Store
seq_store_response = omics.create_sequence_store(
    name='patient-sequencing-data',
    description='WGS and WES patient data',
    sseConfig={
        'type': 'KMS',
        'keyArn': 'arn:aws:kms:us-east-1:123456789:key/my-key'
    },
    fallbackLocation='s3://my-omics-bucket/fallback/'
)
sequence_store_id = seq_store_response['id']

# Import sequencing reads (FASTQ)
reads_import = omics.start_read_set_import_job(
    sequenceStoreId=sequence_store_id,
    roleArn='arn:aws:iam::123456789:role/OmicsRole',
    sources=[
        {
            'sourceFiles': {
                'source1': 's3://my-omics-bucket/samples/patient001_R1.fastq.gz',
                'source2': 's3://my-omics-bucket/samples/patient001_R2.fastq.gz'
            },
            'sourceFileType': 'FASTQ',
            'subjectId': 'PATIENT_001',
            'sampleId': 'SAMPLE_001',
            'name': 'patient001_wgs',
            'description': 'Whole Genome Sequencing - Patient 001',
            'referenceArn': f'arn:aws:omics:us-east-1:123456789:referenceStore/{reference_store_id}/reference/REF_ID'
        }
    ]
)
print(f"Import Job ID: {reads_import['id']}")

สร้างและรัน Workflow (WDL)

# Upload WDL workflow
with open('variant_calling.wdl', 'r') as f:
    workflow_wdl = f.read()

workflow_response = omics.create_workflow(
    name='gatk-germline-variant-calling',
    description='GATK Best Practices Germline Variant Calling',
    engine='WDL',
    definitionZip=open('workflow_package.zip', 'rb').read(),
    main='variant_calling.wdl',
    parameterTemplate={
        'input_bam': {
            'description': 'Input BAM file from alignment',
            'optional': False
        },
        'reference_fasta': {
            'description': 'Reference genome FASTA',
            'optional': False
        },
        'sample_name': {
            'description': 'Sample identifier',
            'optional': False
        }
    },
    tags={
        'pipeline': 'gatk-best-practices',
        'version': '4.4.0'
    }
)
workflow_id = workflow_response['id']

# รัน workflow สำหรับ patient sample
run_response = omics.start_run(
    workflowId=workflow_id,
    workflowType='PRIVATE',
    roleArn='arn:aws:iam::123456789:role/OmicsWorkflowRole',
    name='patient001-variant-calling',
    parameters={
        'input_bam': f'omics://seq-store/{sequence_store_id}/readset/READSET_ID',
        'reference_fasta': f'omics://ref-store/{reference_store_id}/reference/REF_ID',
        'sample_name': 'PATIENT_001'
    },
    outputUri='s3://my-omics-bucket/workflow-outputs/',
    requestId='unique-request-001'
)
run_id = run_response['id']
print(f"Workflow Run ID: {run_id}")

ใช้ Ready2Run Workflow

# ดู Ready2Run workflows ที่มีให้
ready2run = omics.list_workflows(type='READY2RUN')

print("Available Ready2Run Workflows:")
for workflow in ready2run.get('items', []):
    print(f"- {workflow['name']}: {workflow['id']}")

# รัน DeepVariant Ready2Run workflow
deepvariant_run = omics.start_run(
    workflowId='READY2RUN_DEEPVARIANT_ID',
    workflowType='READY2RUN',
    roleArn='arn:aws:iam::123456789:role/OmicsWorkflowRole',
    name='patient001-deepvariant',
    parameters={
        'input_cram': f'omics://seq-store/{sequence_store_id}/readset/READSET_ID',
        'reference': f'omics://ref-store/{reference_store_id}/reference/REF_ID',
        'output_vcf': 's3://my-omics-bucket/deepvariant-output/patient001.vcf.gz'
    },
    outputUri='s3://my-omics-bucket/run-outputs/',
    storageCapacity=100  # GB
)

Query Variants ด้วย Athena

# สร้าง Variant Store
variant_store_response = omics.create_variant_store(
    name='population-variants',
    reference={
        'referenceArn': f'arn:aws:omics:us-east-1:123456789:referenceStore/{reference_store_id}/reference/REF_ID'
    },
    sseConfig={'type': 'KMS', 'keyArn': 'arn:aws:kms:us-east-1:123456789:key/my-key'},
    tags={'project': 'thai-genome-project'}
)

# Import VCF หลัง variant calling
omics.start_variant_import_job(
    destinationName='population-variants',
    roleArn='arn:aws:iam::123456789:role/OmicsRole',
    items=[
        {'source': 's3://my-omics-bucket/vcf/patient001.vcf.gz'},
        {'source': 's3://my-omics-bucket/vcf/patient002.vcf.gz'}
    ]
)

# Query variants ด้วย Athena
athena = boto3.client('athena')

query = """
SELECT 
    chromosome,
    start,
    "end",
    reference_allele,
    alternate_allele,
    quality,
    filter,
    info.CLNSIG as clinical_significance,
    COUNT(*) as sample_count
FROM 
    healthomics.population_variants
WHERE 
    chromosome = 'chr17'
    AND start BETWEEN 43044295 AND 43125364  -- BRCA1 gene region
    AND filter = 'PASS'
    AND info.AF > 0.01  -- variants ที่พบ > 1% ในประชากร
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8
ORDER BY sample_count DESC
LIMIT 50
"""

response = athena.start_query_execution(
    QueryString=query,
    QueryExecutionContext={'Database': 'healthomics'},
    ResultConfiguration={'OutputLocation': 's3://my-athena-results/omics/'}
)
print(f"Athena Query: {response['QueryExecutionId']}")

ราคา (ประมาณการในบาท)

รายการ	ราคา USD	ราคา THB (1 USD = 35 บาท)
Sequence storage	$0.0088/GB/เดือน	~0.31 บาท/GB/เดือน
Variant storage	$0.04/GB/เดือน	~1.40 บาท/GB/เดือน
Annotation storage	$0.025/GB/เดือน	~0.88 บาท/GB/เดือน
Reference storage	ฟรี (AWS public references)	ฟรี
Workflow runs	$0.30/1,000 vCPU-seconds	~10.50 บาท/1,000 vCPU-seconds
Active run storage	$0.04/GB/ชั่วโมง	~1.40 บาท/GB/ชั่วโมง

ตัวอย่างค่าใช้จ่าย: Whole Genome Sequencing 1 sample (100GB FASTQ)

Sequence storage: 100 GB x $0.0088 = $0.88/เดือน (~30.80 บาท)
Workflow run (GATK, ~4 hours): ~2,000 vCPU-minutes = ~$36 (~1,260 บาท)
Output VCF storage: ~5 GB x $0.04 = $0.20/เดือน
ค่า analysis per sample ประมาณ ~1,300 บาท

เหมาะสำหรับ

นักวิจัย genomics ที่ต้องการ scale การวิเคราะห์ DNA/RNA sequencing โดยไม่ต้องดูแล HPC
บริษัท biotech และ pharma ที่ทำ drug discovery จาก genomic data
โรงพยาบาลที่ให้บริการ precision medicine และต้องการ clinical-grade genomic analysis
สถาบันวิจัยที่ทำงานกับ large-scale population genomics
Lab ที่ต้องการ CLIA/CAP-validated bioinformatics pipelines
องค์กรที่ต้องการ GxP-compliant genomics data management

ใช้ร่วมกับ AWS Services

Amazon S3 - staging area สำหรับ raw data และ analysis outputs
Amazon Athena - SQL analytics บน variant stores ระดับ population
Amazon SageMaker - train ML models บน genomic features
Amazon Bedrock - protein structure prediction และ biological question answering
AWS Glue - ETL pipeline สำหรับ genomic data transformation
Amazon EC2 (GPU) - เพิ่มเติม compute สำหรับ deep learning genomics models
Amazon QuickSight - dashboards แสดง genomic analytics
AWS Lake Formation - cross-account data sharing อย่างปลอดภัย

Use Case ตัวอย่าง

1. สถาบันวิจัยมะเร็งวิเคราะห์ Whole Genome Sequencing

สถาบันวิจัยมะเร็งแห่งชาติใช้ AWS HealthOmics วิเคราะห์ WGS ของผู้ป่วยมะเร็งปอดในไทยกว่า 5,000 ราย ใช้ GATK Ready2Run workflow สำหรับ somatic variant calling และ Athena query หา driver mutations ที่พบบ่อยในประชากรไทย workflow ที่เคยใช้เวลา 2 สัปดาห์บน on-premise cluster สามารถรันเสร็จใน 8 ชั่วโมงด้วย HealthOmics ค้นพบ EGFR variants ที่จำเพาะในประชากรไทยที่ไม่เคยมีการรายงานมาก่อน

2. โรงพยาบาล Implement Precision Medicine Clinic

โรงพยาบาล Ramathibodi ใช้ AWS HealthOmics สำหรับ clinical genomics ของผู้ป่วย hereditary cancer โดยรัน GATK Germline pipeline สำหรับ WES (Whole Exome Sequencing) ของผู้ป่วยและครอบครัว ระบบ integrate กับ ClinVar annotation store เพื่อ interpret pathogenicity ของ variants แพทย์ได้รับรายงาน genetic counseling ภายใน 5 วันทำการ เทียบกับ 6 สัปดาห์ด้วยวิธีเดิม

3. Pharma Company ค้นหา Biomarkers สำหรับ Drug Response

บริษัทยาไทยร่วมกับมหาวิทยาลัยใช้ AWS HealthOmics วิเคราะห์ pharmacogenomics สำหรับยา warfarin ในประชากรไทย โดย sequence DNA ของอาสาสมัคร 10,000 คน และ correlate variants ใน CYP2C9, VKORC1 กับ optimal dosing ผลการวิจัยถูกใช้พัฒนา pharmacogenomics-guided dosing algorithm สำหรับประชากรไทย ลดความเสี่ยง bleeding complications จาก anticoagulation therapy ได้ 40%

สถาปัตยกรรม​

ฟีเจอร์หลัก​

Sequence Stores (FASTQ, BAM, CRAM)​

Variant Stores (VCF)​

Annotation Stores (GFF, TSV, VCF)​

Reference Stores (Genome References)​

Omics Workflows (WDL/Nextflow/CWL)​

Ready2Run Workflows (GATK, DeepVariant)​

Analytics (Query Variants at Scale with Athena)​

Cross-Account Sharing​

Compliance (HIPAA, GxP)​

การติดตั้งและการตั้งค่า​

เปิดใช้งานผ่าน Console​

ติดตั้ง SDK​

IAM Permissions ที่จำเป็น​

วิธีใช้งาน​

สร้าง Stores และ Import ข้อมูล​

สร้างและรัน Workflow (WDL)​

ใช้ Ready2Run Workflow​

Query Variants ด้วย Athena​

ราคา (ประมาณการในบาท)​

เหมาะสำหรับ​

ใช้ร่วมกับ AWS Services​

Use Case ตัวอย่าง​

1. สถาบันวิจัยมะเร็งวิเคราะห์ Whole Genome Sequencing​

2. โรงพยาบาล Implement Precision Medicine Clinic​

3. Pharma Company ค้นหา Biomarkers สำหรับ Drug Response​