Synthex User Guide

Welcome to the Synthex User Guide — your comprehensive resource for understanding and using Inwire's synthetic data generation and data management capabilities. Whether you're generating privacy-preserving training data, augmenting imbalanced datasets, or creating test fixtures, this guide will help you get the most out of Synthex.


Table of Contents

  1. Introduction to Synthex
  2. Core Concepts
  3. Getting Started
  4. Working with Datasets
  5. Data Profiles
  6. Data Recipes
  7. Synthetic Data Generation
  8. Generator Configurations
  9. Integration with Model Training
  10. Example Scenarios
  11. Best Practices
  12. Limitations & Future Directions
  13. API Reference

Introduction to Synthex

What is Synthex?

Synthex is Inwire's data management and synthetic data generation service. Think of it as the "data twin" of the Model Training service — while Model Training handles experiments and model development, Synthex handles everything related to data: ingestion, profiling, transformation, and generation.

Why Synthetic Data?

Synthetic data addresses several critical challenges in ML development:

Challenge How Synthex Helps
Privacy Compliance Generate data that preserves statistical properties without exposing real individuals
Data Scarcity Create additional training samples when real data is limited
Class Imbalance Generate samples for underrepresented classes
Testing & Development Create realistic test data without production data access
Edge Cases Generate specific scenarios that rarely occur in real data
Data Sharing Share synthetic versions of sensitive datasets

Synthex in the ML Lifecycle

┌─────────────────────────────────────────────────────────────────────────────┐
│                           Data Flow in Inwire                                │
└─────────────────────────────────────────────────────────────────────────────┘

  External Sources              Synthex                    Model Training
  ───────────────              ─────────                   ──────────────
  ┌─────────┐                ┌───────────┐               ┌───────────────┐
  │  CSV    │───┐            │           │               │               │
  └─────────┘   │            │  Dataset  │               │   Experiment  │
  ┌─────────┐   │  Import    │  Catalog  │    Select     │     Setup     │
  │ Parquet │───┼───────────>│           │──────────────>│               │
  └─────────┘   │            │           │               │   - Dataset   │
  ┌─────────┐   │            └─────┬─────┘               │   - Version   │
  │   S3    │───┘                  │                     │   - Recipe    │
  └─────────┘                      │                     └───────────────┘
                                   │
                            ┌──────┴──────┐
                            │             │
                            ▼             ▼
                    ┌───────────┐  ┌───────────┐
                    │  Profile  │  │  Generate │
                    │  & Clean  │  │ Synthetic │
                    └───────────┘  └───────────┘

Core Concepts

Before diving into workflows, let's understand the key concepts in Synthex:

Datasets

A Dataset is a collection of data that you've imported into Synthex. Datasets are the foundation of all data operations.

Dataset Properties:

Dataset States:

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│ Pending  │───>│Profiling │───>│  Ready   │───>│ Archived │
└──────────┘    └──────────┘    └──────────┘    └──────────┘
                                      │
                                      ▼
                               ┌──────────┐
                               │Processing│ (when recipe applied)
                               └──────────┘

Versions

Every dataset maintains a version history. Each transformation or generation creates a new version while preserving the original.

Dataset: customer_transactions
├── v1.0 (original import)
├── v1.1 (cleaned nulls)
├── v1.2 (normalized amounts)
└── v2.0 (synthetic augmentation)

Version Types:

Type Description
Original Raw imported data
Cleaned After data cleaning recipes
Transformed After transformation recipes
Synthetic Generated synthetic data
Augmented Original + synthetic combined

Data Profiles

A Data Profile defines the schema and statistical properties of a dataset. Profiles are used for:

Profile Components:

profile:
  name: customer_transactions
  columns:
    - name: customer_id
      type: string
      constraints:
        - unique
        - not_null

    - name: transaction_amount
      type: float
      statistics:
        min: 0.01
        max: 99999.99
        mean: 150.75
        distribution: log_normal

    - name: is_fraud
      type: boolean
      distribution:
        true: 0.01
        false: 0.99

Data Recipes

A Data Recipe is a reusable sequence of transformations that can be applied to datasets. Recipes ensure reproducibility and consistency.

Recipe Structure:

recipe:
  name: fraud_data_prep
  steps:
    - type: clean
      action: drop_nulls
      columns: [customer_id, transaction_amount]

    - type: transform
      action: normalize
      column: transaction_amount
      method: min_max

    - type: filter
      condition: "transaction_amount > 0"

    - type: encode
      column: category
      method: one_hot

Generator Configurations

A Generator Configuration defines how synthetic data is created. It specifies:

Modalities

Synthex supports multiple data modalities:

Modality Description Generation Methods
Tabular Structured rows and columns Statistical, GAN, VAE
Text Natural language data LLM, GPT, T5
Time Series Sequential temporal data TimeGAN, Statistical
Image Visual data Diffusion, GAN
Graph Network/relationship data GraphVAE

Getting Started

Accessing Synthex

  1. Log in to Inwire
  2. Click Synthex in the sidebar
  3. You'll see the Synthex dashboard:
┌─────────────────────────────────────────────────────────────────────────────┐
│  Synthex                                                    [+ New Dataset] │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │  Datasets   │  │  Profiles   │  │   Recipes   │  │    Jobs     │        │
│  │     47      │  │     23      │  │     15      │  │  3 Running  │        │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘        │
│                                                                              │
│  Recent Datasets                                              [View All]    │
│  ─────────────────────────────────────────────────────────────────────     │
│  │ Name                    │ Type      │ Rows    │ Status   │ Updated     │ │
│  ├─────────────────────────┼───────────┼─────────┼──────────┼─────────────┤ │
│  │ customer_transactions   │ Tabular   │ 100,000 │ Ready    │ 2 hours ago │ │
│  │ product_reviews         │ Text      │ 50,000  │ Ready    │ 1 day ago   │ │
│  │ sensor_readings         │ TimeSeries│ 1M      │ Profiling│ 5 min ago   │ │
│  └─────────────────────────┴───────────┴─────────┴──────────┴─────────────┘ │
│                                                                              │
│  Active Jobs                                                  [View All]    │
│  ─────────────────────────────────────────────────────────────────────     │
│  │ fraud_synthetic_gen    │ Generation │ ████████░░ 80%  │ ETA: 5 min   │ │
│  │ customer_profile       │ Profiling  │ ██████████ Done │ Complete     │ │
│  └────────────────────────┴────────────┴─────────────────┴──────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Section Purpose
Datasets Browse, import, and manage datasets
Profiles Create and edit data profiles
Recipes Build and manage transformation recipes
Generators Configure synthetic data generators
Jobs Monitor running and completed jobs
Settings Configure Synthex preferences

Working with Datasets

Importing a Dataset

Step 1: Start Import

  1. Go to SynthexDatasets
  2. Click Import Dataset (or + New Dataset)

Step 2: Choose Source

Select your data source:

Source Description Best For
File Upload Upload from local machine Small datasets, quick testing
Cloud Storage Import from S3, GCS, Azure Large datasets, production data
Database Direct database query Live data, scheduled imports
URL Fetch from HTTP endpoint Public datasets, APIs

File Upload Example:

┌─────────────────────────────────────────────────────────────────┐
│  Import Dataset                                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Source: [●] File Upload  [ ] Cloud  [ ] Database  [ ] URL     │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                                                          │   │
│  │           Drag and drop files here                      │   │
│  │                 or click to browse                      │   │
│  │                                                          │   │
│  │           Supported: CSV, Parquet, JSON, JSONL          │   │
│  │                                                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  Selected: transactions_2024.csv (45.2 MB)                      │
│                                                                  │
│                                              [Cancel] [Next →]  │
└─────────────────────────────────────────────────────────────────┘

Cloud Storage Example (S3):

┌─────────────────────────────────────────────────────────────────┐
│  Import from Cloud Storage                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Integration: [Production S3            ▼]                      │
│                                                                  │
│  Path: s3://acme-ml-data/datasets/                              │
│                                                                  │
│  ├── transactions/                                               │
│  │   ├── 2023/                                                   │
│  │   └── 2024/                                                   │
│  │       ├── q1_transactions.parquet    [✓]                     │
│  │       ├── q2_transactions.parquet    [✓]                     │
│  │       └── q3_transactions.parquet    [✓]                     │
│  └── customers/                                                  │
│                                                                  │
│  Selected: 3 files (2.1 GB total)                               │
│                                                                  │
│                                              [Cancel] [Next →]  │
└─────────────────────────────────────────────────────────────────┘

Step 3: Configure Import

Set import options:

Option Description Default
Name Dataset identifier Filename
Description What this data represents
Type Data modality Auto-detected
Tags Organization labels
Auto-profile Run profiling after import Enabled
Sampling Import subset for large files Disabled

Configuration Form:

┌─────────────────────────────────────────────────────────────────┐
│  Configure Dataset                                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Name:        [customer_transactions_2024        ]              │
│                                                                  │
│  Description: [Transaction data for Q1-Q3 2024   ]              │
│               [including fraud labels            ]              │
│                                                                  │
│  Type:        [Tabular                          ▼]              │
│                                                                  │
│  Tags:        [transactions] [fraud] [2024] [+]                 │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Options                                                  │   │
│  ├─────────────────────────────────────────────────────────┤   │
│  │ [✓] Auto-profile after import                           │   │
│  │ [ ] Sample data (for large files)                       │   │
│  │     Sample size: [10000] rows                           │   │
│  │ [✓] Infer data types                                    │   │
│  │ [ ] First row is header                                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│                                            [← Back] [Import]    │
└─────────────────────────────────────────────────────────────────┘

Step 4: Review and Import

Click Import to start the process. You'll see:

  1. Upload Progress — File transfer status
  2. Schema Detection — Automatic column type inference
  3. Profiling — Statistical analysis (if enabled)

Viewing Dataset Details

Click on any dataset to see its details:

┌─────────────────────────────────────────────────────────────────────────────┐
│  customer_transactions_2024                                    [⚙ Actions ▼]│
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Overview │ Schema │ Profile │ Versions │ Lineage │ Access                  │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                              │
│  Status: Ready                    Created: Jan 15, 2024                     │
│  Type: Tabular                    Updated: Jan 15, 2024                     │
│  Rows: 100,000                    Size: 45.2 MB                             │
│  Columns: 12                      Version: v1.0                             │
│                                                                              │
│  Description:                                                                │
│  Transaction data for Q1-Q3 2024 including fraud labels for                 │
│  training fraud detection models.                                            │
│                                                                              │
│  Tags: [transactions] [fraud] [2024] [production]                           │
│                                                                              │
│  Quick Actions:                                                              │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐           │
│  │  Generate  │  │   Apply    │  │   Export   │  │   Clone    │           │
│  │  Synthetic │  │   Recipe   │  │            │  │            │           │
│  └────────────┘  └────────────┘  └────────────┘  └────────────┘           │
└─────────────────────────────────────────────────────────────────────────────┘

Dataset Schema Tab

View and edit column definitions:

┌─────────────────────────────────────────────────────────────────────────────┐
│  Schema                                                        [Edit Schema]│
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  │ Column              │ Type     │ Nullable │ Unique │ Sample Values      ││
│  ├─────────────────────┼──────────┼──────────┼────────┼────────────────────┤│
│  │ transaction_id      │ string   │ No       │ Yes    │ TXN-2024-00001    ││
│  │ customer_id         │ string   │ No       │ No     │ CUST-10042        ││
│  │ timestamp           │ datetime │ No       │ No     │ 2024-01-15 14:32  ││
│  │ amount              │ float    │ No       │ No     │ 127.50, 45.99     ││
│  │ currency            │ string   │ No       │ No     │ USD, EUR, GBP     ││
│  │ merchant_id         │ string   │ No       │ No     │ MERCH-5521        ││
│  │ merchant_category   │ string   │ No       │ No     │ retail, food      ││
│  │ card_type           │ string   │ Yes      │ No     │ credit, debit     ││
│  │ is_international    │ boolean  │ No       │ No     │ true, false       ││
│  │ is_fraud            │ boolean  │ No       │ No     │ true, false       ││
│  │ fraud_type          │ string   │ Yes      │ No     │ card_theft, null  ││
│  │ risk_score          │ float    │ Yes      │ No     │ 0.15, 0.89        ││
│  └─────────────────────┴──────────┴──────────┴────────┴────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘

Dataset Actions

Action Description
Generate Synthetic Create synthetic version
Apply Recipe Transform with a recipe
Export Download or save to cloud
Clone Create a copy
Archive Move to archive
Delete Remove permanently

Data Profiles

Understanding Data Profiles

A Data Profile captures the statistical "fingerprint" of your data. Profiles are automatically created during import but can also be manually defined.

Viewing Profile Results

After profiling completes:

┌─────────────────────────────────────────────────────────────────────────────┐
│  Profile: customer_transactions_2024                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Summary │ Columns │ Correlations │ Anomalies │ Quality Score               │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                              │
│  Overall Quality Score: 87/100 ████████▓░                                   │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Metric                      │ Value          │ Status              │   │
│  ├─────────────────────────────┼────────────────┼─────────────────────┤   │
│  │ Total Rows                  │ 100,000        │ ✓                   │   │
│  │ Total Columns               │ 12             │ ✓                   │   │
│  │ Missing Values              │ 2.3%           │ ⚠ Moderate          │   │
│  │ Duplicate Rows              │ 0.1%           │ ✓ Low               │   │
│  │ Outliers Detected           │ 1.2%           │ ✓ Low               │   │
│  │ Type Consistency            │ 99.8%          │ ✓ High              │   │
│  └─────────────────────────────┴────────────────┴─────────────────────┘   │
│                                                                              │
│  Class Distribution (is_fraud):                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ False (99%)  ████████████████████████████████████████████████      │   │
│  │ True (1%)    █                                                       │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│  ⚠ Warning: Highly imbalanced classes detected                              │
└─────────────────────────────────────────────────────────────────────────────┘

Column-Level Statistics

Each column has detailed statistics:

┌─────────────────────────────────────────────────────────────────────────────┐
│  Column: amount                                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Type: float                                                                 │
│  Missing: 0 (0.0%)                                                          │
│  Unique: 45,231 (45.2%)                                                     │
│                                                                              │
│  ┌───────────────────────┐    Statistics:                                   │
│  │      Distribution     │    ─────────────────                              │
│  │                       │    Min:     0.01                                 │
│  │    ▂▅▇█▇▅▃▂▁         │    Max:     15,420.50                            │
│  │                       │    Mean:    127.43                               │
│  │  0    500   1000+     │    Median:  78.25                                │
│  └───────────────────────┘    Std Dev:  245.67                              │
│                               Skewness: 2.34 (right-skewed)                 │
│                                                                              │
│  Distribution: Log-normal (best fit)                                         │
│                                                                              │
│  Outliers: 1,234 values > 3 std deviations                                  │
└─────────────────────────────────────────────────────────────────────────────┘

Creating Custom Profiles

For synthetic data generation, you may want to define profiles manually:

  1. Go to SynthexProfiles
  2. Click Create Profile
  3. Define columns and constraints:
# Example profile definition
name: custom_transactions
description: Custom profile for transaction generation

columns:
  - name: customer_id
    type: string
    generator: uuid

  - name: amount
    type: float
    constraints:
      min: 0.01
      max: 10000
    distribution:
      type: log_normal
      mean: 100
      std: 200

  - name: is_fraud
    type: boolean
    distribution:
      true: 0.02  # 2% fraud rate
      false: 0.98

  - name: timestamp
    type: datetime
    constraints:
      min: "2024-01-01"
      max: "2024-12-31"
    distribution:
      type: uniform

correlations:
  - columns: [amount, is_fraud]
    type: positive
    strength: 0.3  # Higher amounts slightly more likely to be fraud

Data Recipes

What are Recipes?

Data Recipes are reusable transformation pipelines that you can apply to datasets. They ensure:

Creating a Recipe

  1. Go to SynthexRecipes
  2. Click Create Recipe
  3. Add transformation steps:
┌─────────────────────────────────────────────────────────────────────────────┐
│  Create Recipe                                                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Name:        [fraud_data_preparation        ]                              │
│  Description: [Prepare transaction data for fraud detection training]       │
│                                                                              │
│  Steps:                                                        [+ Add Step] │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                              │
│  1. ┌─────────────────────────────────────────────────────────────────┐    │
│     │ Clean: Drop Null Values                                [✎] [×] │    │
│     │ Columns: customer_id, amount, timestamp                        │    │
│     │ Action: Drop rows with null values in specified columns        │    │
│     └─────────────────────────────────────────────────────────────────┘    │
│                               ↓                                             │
│  2. ┌─────────────────────────────────────────────────────────────────┐    │
│     │ Filter: Remove Invalid Transactions                    [✎] [×] │    │
│     │ Condition: amount > 0 AND amount < 100000                      │    │
│     │ Action: Keep only rows matching condition                       │    │
│     └─────────────────────────────────────────────────────────────────┘    │
│                               ↓                                             │
│  3. ┌─────────────────────────────────────────────────────────────────┐    │
│     │ Transform: Normalize Amount                            [✎] [×] │    │
│     │ Column: amount                                                  │    │
│     │ Method: Log transformation                                      │    │
│     └─────────────────────────────────────────────────────────────────┘    │
│                               ↓                                             │
│  4. ┌─────────────────────────────────────────────────────────────────┐    │
│     │ Encode: One-Hot Encoding                               [✎] [×] │    │
│     │ Column: merchant_category                                       │    │
│     │ Output: merchant_category_retail, merchant_category_food, ...   │    │
│     └─────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│                                              [Cancel] [Save Recipe]         │
└─────────────────────────────────────────────────────────────────────────────┘

Recipe Step Types

Step Type Description Use Case
Clean Handle missing/invalid data Data quality
Filter Remove rows by condition Data selection
Transform Modify column values Feature engineering
Encode Convert categorical data ML preparation
Aggregate Group and summarize Feature creation
Join Combine with other datasets Data enrichment
Sample Random subset Testing, balancing
Split Divide into subsets Train/test split

Applying a Recipe

To apply a recipe to a dataset:

  1. Go to the dataset detail page
  2. Click Apply Recipe
  3. Select the recipe
  4. Choose output options:
┌─────────────────────────────────────────────────────────────────────────────┐
│  Apply Recipe                                                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Dataset: customer_transactions_2024                                         │
│  Recipe:  [fraud_data_preparation                 ▼]                        │
│                                                                              │
│  Output Options:                                                             │
│  ──────────────                                                             │
│  [●] Create new version (recommended)                                       │
│      Version name: [v1.1-cleaned                  ]                         │
│                                                                              │
│  [ ] Create new dataset                                                     │
│      Dataset name: [                              ]                         │
│                                                                              │
│  [ ] Replace current version (destructive)                                  │
│                                                                              │
│  Preview Changes:                                                            │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Step 1: Drop Null Values                                            │   │
│  │   → Rows affected: 2,312 (will be removed)                          │   │
│  │                                                                      │   │
│  │ Step 2: Filter Invalid Transactions                                 │   │
│  │   → Rows affected: 45 (will be removed)                             │   │
│  │                                                                      │   │
│  │ Step 3: Normalize Amount                                            │   │
│  │   → Column 'amount' will be log-transformed                         │   │
│  │                                                                      │   │
│  │ Step 4: One-Hot Encoding                                            │   │
│  │   → 8 new columns will be created                                   │   │
│  │                                                                      │   │
│  │ Final: 97,643 rows, 19 columns                                      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│                                              [Cancel] [Apply Recipe]        │
└─────────────────────────────────────────────────────────────────────────────┘

Synthetic Data Generation

Generation Methods

Synthex supports multiple synthetic data generation methods:

Method Description Best For Speed
Statistical Preserves distributions and correlations Tabular data, quick generation Fast
GAN Generative Adversarial Networks Complex patterns, high fidelity Slow
VAE Variational Autoencoders Balanced quality/speed Medium
CTGAN Conditional Tabular GAN Mixed-type tabular data Medium
CopulaGAN Copula-based GAN Preserving correlations Medium
LLM Large Language Models Text data, complex semantics Slow
TimeGAN Temporal GAN Time series data Slow
Diffusion Diffusion models High-quality images Very Slow

Starting a Generation Job

  1. Go to SynthexDatasets
  2. Select your source dataset
  3. Click Generate Synthetic
  4. Configure generation:
┌─────────────────────────────────────────────────────────────────────────────┐
│  Generate Synthetic Data                                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Source: customer_transactions_2024 (v1.0)                                  │
│                                                                              │
│  Generation Method                                                           │
│  ──────────────────                                                         │
│  [●] Statistical (Copula)     - Fast, good for tabular data                │
│  [ ] CTGAN                    - Better for mixed types, slower              │
│  [ ] CopulaGAN               - Best correlation preservation                │
│  [ ] GaussianCopula          - Fastest, simple distributions                │
│                                                                              │
│  Configuration                                                               │
│  ──────────────                                                             │
│  Number of records:    [100000        ]                                     │
│                                                                              │
│  Privacy Settings:                                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ [✓] Enable differential privacy                                     │   │
│  │     Epsilon (ε): [1.0        ] (lower = more private)              │   │
│  │                                                                      │   │
│  │ [✓] Anonymize identifiers                                           │   │
│  │     Columns: customer_id, merchant_id                               │   │
│  │                                                                      │   │
│  │ [ ] Add noise to numerical columns                                  │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  Advanced Options:                                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Batch size:        [1000       ]                                    │   │
│  │ Random seed:       [42         ] (for reproducibility)             │   │
│  │ Constraint handling: [Reject invalid ▼]                            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  Output                                                                      │
│  ──────                                                                     │
│  [●] Create new synthetic dataset                                           │
│      Name: [customer_transactions_2024_synthetic  ]                         │
│                                                                              │
│  [ ] Augment existing dataset (combine with original)                       │
│      Augmentation ratio: [50%         ]                                     │
│                                                                              │
│                                         [Cancel] [Start Generation]         │
└─────────────────────────────────────────────────────────────────────────────┘

Monitoring Generation Jobs

Track job progress in the Jobs view:

┌─────────────────────────────────────────────────────────────────────────────┐
│  Job: fraud_synthetic_generation                              [Cancel Job]  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Status: Running                                                             │
│  Progress: ████████████████████░░░░░░░░░░ 65%                               │
│                                                                              │
│  Started: Jan 15, 2024 14:32:15                                             │
│  Elapsed: 12 minutes                                                         │
│  ETA: ~6 minutes remaining                                                   │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Stage                        │ Status    │ Progress  │ Duration    │   │
│  ├──────────────────────────────┼───────────┼───────────┼─────────────┤   │
│  │ 1. Load source data          │ Complete  │ 100%      │ 0:45        │   │
│  │ 2. Fit generator model       │ Complete  │ 100%      │ 8:23        │   │
│  │ 3. Generate synthetic rows   │ Running   │ 65%       │ 3:12        │   │
│  │ 4. Validate output           │ Pending   │ —         │ —           │   │
│  │ 5. Save results              │ Pending   │ —         │ —           │   │
│  └──────────────────────────────┴───────────┴───────────┴─────────────┘   │
│                                                                              │
│  Logs:                                                                       │
│  ───────────────────────────────────────────────────────────────────────── │
│  [14:32:15] Starting generation job...                                       │
│  [14:33:00] Source data loaded: 100,000 rows                                │
│  [14:33:02] Fitting CTGAN model...                                          │
│  [14:41:25] Model training complete                                          │
│  [14:41:26] Generating synthetic samples: batch 1/100                       │
│  [14:44:38] Progress: 65,000/100,000 samples generated                      │
└─────────────────────────────────────────────────────────────────────────────┘

Evaluating Synthetic Data Quality

After generation, Synthex automatically evaluates quality:

┌─────────────────────────────────────────────────────────────────────────────┐
│  Quality Report: customer_transactions_2024_synthetic                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Overall Quality Score: 92/100 █████████▓                                   │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Metric                           │ Score  │ Status                  │   │
│  ├──────────────────────────────────┼────────┼─────────────────────────┤   │
│  │ Statistical Similarity           │ 94%    │ ✓ Excellent             │   │
│  │ Distribution Match (KL Div)      │ 0.02   │ ✓ Very Low              │   │
│  │ Correlation Preservation         │ 91%    │ ✓ Good                  │   │
│  │ Constraint Satisfaction          │ 100%   │ ✓ Perfect               │   │
│  │ Privacy Score (k-anonymity)      │ k=15   │ ✓ Strong                │   │
│  │ ML Efficacy                      │ 89%    │ ✓ Good                  │   │
│  └──────────────────────────────────┴────────┴─────────────────────────┘   │
│                                                                              │
│  Column-by-Column Comparison:                                                │
│  ──────────────────────────────────────────────────────────────────────────│
│                                                                              │
│  amount:                                                                     │
│  ┌────────────────────────┐  ┌────────────────────────┐                    │
│  │ Original Distribution  │  │ Synthetic Distribution │                    │
│  │     ▂▅▇█▇▅▃▂▁         │  │     ▂▄▇█▇▅▃▂▁         │                    │
│  │  0    500   1000+      │  │  0    500   1000+      │                    │
│  └────────────────────────┘  └────────────────────────┘                    │
│  Similarity: 96% ████████████████████░                                      │
│                                                                              │
│  is_fraud:                                                                   │
│  Original:  True: 1.02%  False: 98.98%                                      │
│  Synthetic: True: 1.01%  False: 98.99%                                      │
│  Similarity: 99% ████████████████████                                       │
│                                                                              │
│                                               [Export Report] [Download PDF] │
└─────────────────────────────────────────────────────────────────────────────┘

Generator Configurations

Saving Generator Configs

For repeatable generation, save your configurations:

# Example saved configuration
name: fraud_augmentation_config
description: Generate fraud cases for training data augmentation

source:
  type: dataset
  name: customer_transactions_2024
  version: v1.0
  filter: "is_fraud = true"  # Learn only from fraud cases

method: ctgan
parameters:
  epochs: 300
  batch_size: 500
  discriminator_steps: 1
  log_frequency: true

privacy:
  differential_privacy:
    enabled: true
    epsilon: 1.0
  anonymize_columns:
    - customer_id
    - merchant_id

output:
  records: 50000
  format: parquet
  destination: s3://ml-data/synthetic/

Managing Configurations

View and manage saved configs:

┌─────────────────────────────────────────────────────────────────────────────┐
│  Generator Configurations                                    [+ New Config] │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  │ Name                      │ Method  │ Source Dataset     │ Last Used    ││
│  ├───────────────────────────┼─────────┼────────────────────┼──────────────┤│
│  │ fraud_augmentation        │ CTGAN   │ customer_trans...  │ 2 days ago   ││
│  │ review_generation         │ GPT-4   │ product_reviews    │ 1 week ago   ││
│  │ sensor_timeseries         │ TimeGAN │ sensor_readings    │ 3 days ago   ││
│  │ privacy_safe_customers    │ Copula  │ customer_data      │ Today        ││
│  └───────────────────────────┴─────────┴────────────────────┴──────────────┘│
│                                                                              │
│  Selected: fraud_augmentation                                                │
│  [Run Now] [Edit] [Clone] [Delete] [View History]                           │
└─────────────────────────────────────────────────────────────────────────────┘

Integration with Model Training

The Synthex-Training Connection

Synthex and Model Training work together seamlessly:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Data-to-Model Workflow                                │
└─────────────────────────────────────────────────────────────────────────────┘

                    Synthex                              Model Training
    ┌────────────────────────────────────┐    ┌────────────────────────────────┐
    │                                    │    │                                │
    │  ┌──────────┐                      │    │      ┌──────────────────┐     │
    │  │ Datasets │──────────────────────│───>│──────│ Dataset Selector │     │
    │  └──────────┘                      │    │      └────────┬─────────┘     │
    │       │                            │    │               │               │
    │       ▼                            │    │               ▼               │
    │  ┌──────────┐                      │    │      ┌──────────────────┐     │
    │  │ Profiles │                      │    │      │   Experiment     │     │
    │  └──────────┘                      │    │      │   Configuration  │     │
    │       │                            │    │      └────────┬─────────┘     │
    │       ▼                            │    │               │               │
    │  ┌──────────┐      ┌──────────┐   │    │               ▼               │
    │  │ Recipes  │─────>│ Versions │───│───>│      ┌──────────────────┐     │
    │  └──────────┘      └──────────┘   │    │      │  Training Run    │     │
    │                                    │    │      └──────────────────┘     │
    │  ┌──────────┐                      │    │                                │
    │  │Synthetic │──────────────────────│───>│                                │
    │  └──────────┘                      │    │                                │
    │                                    │    │                                │
    └────────────────────────────────────┘    └────────────────────────────────┘

Selecting Data in Training Wizard

When creating a training experiment:

┌─────────────────────────────────────────────────────────────────────────────┐
│  Training Wizard - Step 2: Select Dataset                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Select data source for training:                                            │
│                                                                              │
│  Data Source: [From Synthex               ▼]                                │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Available Datasets                                   [Search...]    │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │                                                                      │   │
│  │ □ customer_transactions_2024                                         │   │
│  │   │── v1.0 (original)           100,000 rows                        │   │
│  │   │── v1.1 (cleaned)            97,643 rows                         │   │
│  │   └── v2.0 (augmented)          147,643 rows   ← Recommended        │   │
│  │                                                                      │   │
│  │ ☑ customer_transactions_2024_synthetic                              │   │
│  │   └── v1.0 (generated)          100,000 rows                        │   │
│  │                                                                      │   │
│  │ □ product_reviews                                                    │   │
│  │   └── v1.0 (original)           50,000 rows                         │   │
│  │                                                                      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  Apply Recipe: [fraud_data_preparation    ▼] (optional)                     │
│                                                                              │
│  Data Split:                                                                 │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Training:   [70%  ] ████████████████████████████░░░░░░░░░░░░░░░░   │   │
│  │ Validation: [15%  ] ░░░░░░░░░░░░░░░░░░░░░░░░░░░░█████░░░░░░░░░░░   │   │
│  │ Test:       [15%  ] ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░█████░░░░░░   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  Stratify by: [is_fraud               ▼] (maintain class distribution)     │
│                                                                              │
│                                               [← Back] [Next: Configure →]  │
└─────────────────────────────────────────────────────────────────────────────┘

Lineage Tracking

Model Training records exactly which data was used:

┌─────────────────────────────────────────────────────────────────────────────┐
│  Experiment: fraud-detector-v3                                               │
│  Run: run-2024-01-15-001                                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Data Lineage:                                                               │
│  ─────────────                                                              │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  ┌───────────────────┐     ┌───────────────────┐                    │   │
│  │  │ customer_trans... │────>│ fraud_data_prep   │                    │   │
│  │  │ v1.0 (original)   │     │ (recipe applied)  │                    │   │
│  │  └───────────────────┘     └─────────┬─────────┘                    │   │
│  │                                      │                               │   │
│  │  ┌───────────────────┐               │                               │   │
│  │  │ synthetic_fraud   │               │                               │   │
│  │  │ v1.0 (generated)  │───────────────┤                               │   │
│  │  └───────────────────┘               │                               │   │
│  │                                      ▼                               │   │
│  │                            ┌───────────────────┐                     │   │
│  │                            │ Training Dataset  │                     │   │
│  │                            │ 147,643 rows      │                     │   │
│  │                            │ 70% train/15% val │                     │   │
│  │                            └─────────┬─────────┘                     │   │
│  │                                      │                               │   │
│  │                                      ▼                               │   │
│  │                            ┌───────────────────┐                     │   │
│  │                            │ fraud-detector-v3 │                     │   │
│  │                            │ (trained model)   │                     │   │
│  │                            └───────────────────┘                     │   │
│  │                                                                      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  [View Full Lineage] [Export Lineage Report]                                │
└─────────────────────────────────────────────────────────────────────────────┘

Example Scenarios

Scenario 1: Augmenting Imbalanced Fraud Data

Problem: You have transaction data with only 1% fraud cases, leading to poor model performance on the minority class.

Solution: Generate synthetic fraud cases to balance the dataset.

Step 1: Analyze the Imbalance

  1. Go to SynthexDatasets
  2. Open customer_transactions
  3. View the Profile tab
Class Distribution (is_fraud):
├── False: 99,000 (99%)
└── True:   1,000 (1%)

⚠ Warning: Severe class imbalance detected
   Recommendation: Consider synthetic augmentation

Step 2: Create Fraud-Only Generator

  1. Click Generate Synthetic
  2. Configure:

- Filter source: is_fraud = true (learn only from fraud patterns)

- Method: CTGAN

- Records: 49,000 (to achieve ~50% fraud after augmentation)

# Generation config
source_filter: "is_fraud = true"
method: ctgan
parameters:
  epochs: 500
  batch_size: 100

output:
  records: 49000
  mode: augment  # Combine with original

Step 3: Run Generation and Validate

After generation completes:

Augmented Dataset Summary:
├── Total rows: 149,000
├── Original (real): 100,000
├── Synthetic fraud: 49,000
└── Class distribution:
    ├── False: 99,000 (66.4%)
    └── True:  50,000 (33.6%)

Quality Score: 91/100
Fraud pattern preservation: 94%

Step 4: Use in Training

  1. Go to Model TrainingNew Experiment
  2. Select the augmented dataset
  3. Enable stratified sampling
  4. Train your model

Result: Model recall on fraud cases improves from 45% to 82%.


Scenario 2: Creating Privacy-Safe Test Data

Problem: Your QA team needs realistic test data but cannot access production data due to privacy regulations.

Solution: Generate synthetic data that preserves statistical properties without containing real customer information.

Step 1: Import and Profile Production Data

Work with your data team to import a sample of production data into Synthex. Enable automatic profiling.

Step 2: Configure Privacy-Safe Generation

  1. Select the dataset
  2. Click Generate Synthetic
  3. Enable privacy features:
# Privacy-focused configuration
method: gaussian_copula
parameters:
  default_distribution: parametric

privacy:
  differential_privacy:
    enabled: true
    epsilon: 0.5  # Strong privacy guarantee

  anonymize_columns:
    - customer_id
    - email
    - phone
    - address

  pii_handling:
    names: synthetic  # Generate fake names
    dates: shift      # Randomly shift dates
    amounts: noise    # Add controlled noise

output:
  records: 10000
  format: csv

Step 3: Validate Privacy

Review the privacy report:

Privacy Assessment:
├── k-anonymity: k=50 (Strong)
├── l-diversity: l=10 (Strong)
├── Re-identification risk: <0.1% (Minimal)
└── PII detection: None found

All identifiers successfully anonymized.
No real customer data exposed.

Step 4: Export for QA

  1. Click Export
  2. Choose format (CSV, JSON)
  3. Download or send to cloud storage

Result: QA team has realistic test data without privacy concerns.


Scenario 3: Generating Training Data from Schema

Problem: You're building a new feature but have no historical data yet. You need realistic data to develop and test your model.

Solution: Define a data profile from scratch and generate synthetic data matching your expected schema.

Step 1: Create a Custom Profile

  1. Go to SynthexProfiles
  2. Click Create Profile
  3. Define your expected schema:
name: subscription_churn_profile
description: Expected data for subscription churn prediction

columns:
  - name: user_id
    type: string
    generator: uuid

  - name: signup_date
    type: datetime
    constraints:
      min: "2022-01-01"
      max: "2024-01-01"
    distribution:
      type: uniform

  - name: subscription_tier
    type: category
    values: [free, basic, premium, enterprise]
    distribution:
      free: 0.40
      basic: 0.35
      premium: 0.20
      enterprise: 0.05

  - name: monthly_usage_hours
    type: float
    constraints:
      min: 0
      max: 200
    distribution:
      type: beta
      alpha: 2
      beta: 5
      scale: 200

  - name: support_tickets
    type: integer
    constraints:
      min: 0
      max: 50
    distribution:
      type: poisson
      lambda: 2

  - name: churned
    type: boolean
    distribution:
      true: 0.15
      false: 0.85

correlations:
  - columns: [monthly_usage_hours, churned]
    type: negative
    strength: 0.4  # Less usage → more likely to churn

  - columns: [support_tickets, churned]
    type: positive
    strength: 0.3  # More tickets → more likely to churn

Step 2: Generate from Profile

  1. Go to SynthexGenerate Data
  2. Select From Profile
  3. Choose your custom profile
  4. Set record count (e.g., 50,000)

Step 3: Validate Generated Data

Review that distributions match expectations:

Generated Dataset Validation:
├── subscription_tier distribution: ✓ Matches profile
├── churned rate: 14.8% (expected 15%) ✓
├── usage-churn correlation: -0.38 (expected -0.4) ✓
└── All constraints satisfied: ✓

Step 4: Iterate and Refine

As you develop, refine your profile based on insights:

# Updated profile based on domain feedback
columns:
  - name: monthly_usage_hours
    type: float
    # Refined: usage differs by tier
    conditional_distribution:
      on: subscription_tier
      distributions:
        free:
          type: exponential
          scale: 10
        premium:
          type: normal
          mean: 80
          std: 30

Result: Realistic development data aligned with business expectations.


Scenario 4: Text Data Generation for NLP

Problem: You need training data for a customer service classifier but have limited labeled examples.

Solution: Use LLM-based generation to create diverse training examples.

Step 1: Prepare Seed Examples

Upload a small dataset of labeled examples:

Category: billing_question
Examples:
- "Why was I charged twice this month?"
- "Can you explain this fee on my invoice?"
- "I need a refund for the overcharge"

Category: technical_support
Examples:
- "The app keeps crashing when I open it"
- "I can't log in to my account"
- "The feature isn't working as expected"

Category: general_inquiry
Examples:
- "What are your business hours?"
- "Do you ship internationally?"
- "How do I contact support?"

Step 2: Configure LLM Generation

  1. Select the seed dataset
  2. Choose LLM method
  3. Configure:
method: llm
parameters:
  model: gpt-4
  temperature: 0.8  # Some creativity

  prompt_template: |
    Generate a diverse customer service message for category: {category}

    Examples of this category:
    {examples}

    Generate a new, unique message that fits this category.
    Be diverse in tone, length, and specific issues.

  preserve_label: true
  variations_per_example: 10

constraints:
  min_length: 20
  max_length: 200
  language: english

Step 3: Review and Filter

Generated examples are scored for quality:

Generated Examples (billing_question):

[Score: 0.95] "I noticed an unexpected $15 charge on my
              statement dated March 3rd - could you help
              me understand what this is for?"

[Score: 0.91] "hey there, pretty confused about my bill.
              shows i paid but also have a balance due??"

[Score: 0.87] "Looking at my invoice #INV-2024-001, the
              subtotal doesn't seem to match the itemized
              charges. Please advise."

[Score: 0.45] "Payment question" [Filtered - too short]

Step 4: Export for Training

Export the quality-filtered dataset for NLP model training.

Result: Expanded training set from 100 to 1,000+ labeled examples.


Best Practices

Data Quality

Practice Description
Profile before generating Always understand your source data's statistics
Validate outputs Check that generated data matches expected distributions
Preserve correlations Use methods that maintain relationships between columns
Test with real models Validate synthetic data improves actual model performance

Privacy & Security

Practice Description
Enable differential privacy For sensitive data, always use DP guarantees
Anonymize identifiers Never generate data that could identify real individuals
Audit before sharing Review generated data before distributing
Document data lineage Track which real data influenced synthetic outputs

Performance

Practice Description
Start with statistical methods Fastest and often sufficient
Use GANs for complex patterns When statistical methods aren't capturing nuances
Batch large generations Split very large generation jobs
Cache generator models Reuse trained generators for multiple outputs

Reproducibility

Practice Description
Set random seeds Enable exact reproduction of results
Version your configs Save and version generator configurations
Document generation params Record all parameters used
Link to experiments Track which generated data trained which models

Limitations & Future Directions

Current Limitations

Limitation Description Workaround
Complex dependencies Very intricate column relationships may not be fully captured Use domain-specific post-processing
Rare events Extremely rare patterns (< 0.1%) are hard to learn Oversample rare cases in source data
Sequential consistency Time series generation may have temporal artifacts Use TimeGAN or domain validation
Image quality High-resolution images require significant compute Start with lower resolution, upscale
Multi-table Cross-table relationships require manual configuration Use relational synthesis features

Planned Features

Method Selection Guide

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Choosing a Generation Method                              │
└─────────────────────────────────────────────────────────────────────────────┘

                         What type of data?
                              │
            ┌─────────────────┼─────────────────┐
            ▼                 ▼                 ▼
        Tabular            Text           Time Series
            │                 │                 │
      ┌─────┴─────┐          │          ┌──────┴──────┐
      ▼           ▼          ▼          ▼             ▼
  Simple?    Complex?     Short/     Long/         Regular?
      │           │       Labels?    Context?          │
      ▼           ▼          │          │             ▼
  Statistical  CTGAN/      ▼          ▼          TimeGAN
  (Copula)    CopulaGAN   Template   LLM/GPT

API Reference

Datasets API

Base URL: /api/v1/datasets

GET    /                    List all datasets
POST   /                    Create/import dataset
GET    /{id}                Get dataset details
PUT    /{id}                Update dataset
DELETE /{id}                Delete dataset
GET    /{id}/versions       List versions
GET    /{id}/profile        Get profile results
POST   /{id}/profile        Trigger profiling
GET    /{id}/sample         Get data sample
POST   /{id}/export         Export dataset

Profiles API

Base URL: /api/v1/profiles

GET    /                    List all profiles
POST   /                    Create profile
GET    /{id}                Get profile details
PUT    /{id}                Update profile
DELETE /{id}                Delete profile
POST   /{id}/validate       Validate data against profile

Recipes API

Base URL: /api/v1/recipes

GET    /                    List all recipes
POST   /                    Create recipe
GET    /{id}                Get recipe details
PUT    /{id}                Update recipe
DELETE /{id}                Delete recipe
POST   /{id}/apply          Apply recipe to dataset
POST   /{id}/preview        Preview recipe results

Generators API

Base URL: /api/v1/generators

GET    /                    List generator configs
POST   /                    Create generator config
GET    /{id}                Get config details
PUT    /{id}                Update config
DELETE /{id}                Delete config
POST   /{id}/run            Start generation job

Jobs API

Base URL: /api/v1/jobs

GET    /                    List all jobs
GET    /{id}                Get job details
POST   /{id}/cancel         Cancel running job
GET    /{id}/logs           Get job logs
GET    /{id}/results        Get job results

Conclusion

Synthex is a powerful tool for managing your ML data lifecycle. Whether you're:

...Synthex provides the capabilities you need.

For questions or feedback, consult your Inwire administrator or visit the User Guide.


See Also