Synthex User Guide
Welcome to the Synthex User Guide — your comprehensive resource for understanding and using Inwire's synthetic data generation and data management capabilities. Whether you're generating privacy-preserving training data, augmenting imbalanced datasets, or creating test fixtures, this guide will help you get the most out of Synthex.
Table of Contents
- Introduction to Synthex
- Core Concepts
- Getting Started
- Working with Datasets
- Data Profiles
- Data Recipes
- Synthetic Data Generation
- Generator Configurations
- Integration with Model Training
- Example Scenarios
- Best Practices
- Limitations & Future Directions
- API Reference
Introduction to Synthex
What is Synthex?
Synthex is Inwire's data management and synthetic data generation service. Think of it as the "data twin" of the Model Training service — while Model Training handles experiments and model development, Synthex handles everything related to data: ingestion, profiling, transformation, and generation.
Why Synthetic Data?
Synthetic data addresses several critical challenges in ML development:
| Challenge | How Synthex Helps |
|---|---|
| Privacy Compliance | Generate data that preserves statistical properties without exposing real individuals |
| Data Scarcity | Create additional training samples when real data is limited |
| Class Imbalance | Generate samples for underrepresented classes |
| Testing & Development | Create realistic test data without production data access |
| Edge Cases | Generate specific scenarios that rarely occur in real data |
| Data Sharing | Share synthetic versions of sensitive datasets |
Synthex in the ML Lifecycle
┌─────────────────────────────────────────────────────────────────────────────┐
│ Data Flow in Inwire │
└─────────────────────────────────────────────────────────────────────────────┘
External Sources Synthex Model Training
─────────────── ───────── ──────────────
┌─────────┐ ┌───────────┐ ┌───────────────┐
│ CSV │───┐ │ │ │ │
└─────────┘ │ │ Dataset │ │ Experiment │
┌─────────┐ │ Import │ Catalog │ Select │ Setup │
│ Parquet │───┼───────────>│ │──────────────>│ │
└─────────┘ │ │ │ │ - Dataset │
┌─────────┐ │ └─────┬─────┘ │ - Version │
│ S3 │───┘ │ │ - Recipe │
└─────────┘ │ └───────────────┘
│
┌──────┴──────┐
│ │
▼ ▼
┌───────────┐ ┌───────────┐
│ Profile │ │ Generate │
│ & Clean │ │ Synthetic │
└───────────┘ └───────────┘
Core Concepts
Before diving into workflows, let's understand the key concepts in Synthex:
Datasets
A Dataset is a collection of data that you've imported into Synthex. Datasets are the foundation of all data operations.
Dataset Properties:
- Name — Human-readable identifier
- Type — Tabular, Text, Time Series, Image, etc.
- Source — Where the data came from (upload, cloud, database)
- Schema — Column definitions and data types
- Statistics — Profiling results (distributions, nulls, outliers)
- Tags — For organization and filtering
Dataset States:
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Pending │───>│Profiling │───>│ Ready │───>│ Archived │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│
▼
┌──────────┐
│Processing│ (when recipe applied)
└──────────┘
Versions
Every dataset maintains a version history. Each transformation or generation creates a new version while preserving the original.
Dataset: customer_transactions
├── v1.0 (original import)
├── v1.1 (cleaned nulls)
├── v1.2 (normalized amounts)
└── v2.0 (synthetic augmentation)
Version Types:
| Type | Description |
|---|---|
| Original | Raw imported data |
| Cleaned | After data cleaning recipes |
| Transformed | After transformation recipes |
| Synthetic | Generated synthetic data |
| Augmented | Original + synthetic combined |
Data Profiles
A Data Profile defines the schema and statistical properties of a dataset. Profiles are used for:
- Validating data quality
- Guiding synthetic generation
- Ensuring consistency across versions
Profile Components:
profile:
name: customer_transactions
columns:
- name: customer_id
type: string
constraints:
- unique
- not_null
- name: transaction_amount
type: float
statistics:
min: 0.01
max: 99999.99
mean: 150.75
distribution: log_normal
- name: is_fraud
type: boolean
distribution:
true: 0.01
false: 0.99
Data Recipes
A Data Recipe is a reusable sequence of transformations that can be applied to datasets. Recipes ensure reproducibility and consistency.
Recipe Structure:
recipe:
name: fraud_data_prep
steps:
- type: clean
action: drop_nulls
columns: [customer_id, transaction_amount]
- type: transform
action: normalize
column: transaction_amount
method: min_max
- type: filter
condition: "transaction_amount > 0"
- type: encode
column: category
method: one_hot
Generator Configurations
A Generator Configuration defines how synthetic data is created. It specifies:
- Method — Statistical, GAN, LLM, etc.
- Source — Profile or existing dataset to learn from
- Parameters — Method-specific settings
- Output — Where to store generated data
Modalities
Synthex supports multiple data modalities:
| Modality | Description | Generation Methods |
|---|---|---|
| Tabular | Structured rows and columns | Statistical, GAN, VAE |
| Text | Natural language data | LLM, GPT, T5 |
| Time Series | Sequential temporal data | TimeGAN, Statistical |
| Image | Visual data | Diffusion, GAN |
| Graph | Network/relationship data | GraphVAE |
Getting Started
Accessing Synthex
- Log in to Inwire
- Click Synthex in the sidebar
- You'll see the Synthex dashboard:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Synthex [+ New Dataset] │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Datasets │ │ Profiles │ │ Recipes │ │ Jobs │ │
│ │ 47 │ │ 23 │ │ 15 │ │ 3 Running │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Recent Datasets [View All] │
│ ───────────────────────────────────────────────────────────────────── │
│ │ Name │ Type │ Rows │ Status │ Updated │ │
│ ├─────────────────────────┼───────────┼─────────┼──────────┼─────────────┤ │
│ │ customer_transactions │ Tabular │ 100,000 │ Ready │ 2 hours ago │ │
│ │ product_reviews │ Text │ 50,000 │ Ready │ 1 day ago │ │
│ │ sensor_readings │ TimeSeries│ 1M │ Profiling│ 5 min ago │ │
│ └─────────────────────────┴───────────┴─────────┴──────────┴─────────────┘ │
│ │
│ Active Jobs [View All] │
│ ───────────────────────────────────────────────────────────────────── │
│ │ fraud_synthetic_gen │ Generation │ ████████░░ 80% │ ETA: 5 min │ │
│ │ customer_profile │ Profiling │ ██████████ Done │ Complete │ │
│ └────────────────────────┴────────────┴─────────────────┴──────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Navigation Overview
| Section | Purpose |
|---|---|
| Datasets | Browse, import, and manage datasets |
| Profiles | Create and edit data profiles |
| Recipes | Build and manage transformation recipes |
| Generators | Configure synthetic data generators |
| Jobs | Monitor running and completed jobs |
| Settings | Configure Synthex preferences |
Working with Datasets
Importing a Dataset
Step 1: Start Import
- Go to Synthex → Datasets
- Click Import Dataset (or + New Dataset)
Step 2: Choose Source
Select your data source:
| Source | Description | Best For |
|---|---|---|
| File Upload | Upload from local machine | Small datasets, quick testing |
| Cloud Storage | Import from S3, GCS, Azure | Large datasets, production data |
| Database | Direct database query | Live data, scheduled imports |
| URL | Fetch from HTTP endpoint | Public datasets, APIs |
File Upload Example:
┌─────────────────────────────────────────────────────────────────┐
│ Import Dataset │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Source: [●] File Upload [ ] Cloud [ ] Database [ ] URL │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Drag and drop files here │ │
│ │ or click to browse │ │
│ │ │ │
│ │ Supported: CSV, Parquet, JSON, JSONL │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Selected: transactions_2024.csv (45.2 MB) │
│ │
│ [Cancel] [Next →] │
└─────────────────────────────────────────────────────────────────┘
Cloud Storage Example (S3):
┌─────────────────────────────────────────────────────────────────┐
│ Import from Cloud Storage │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Integration: [Production S3 ▼] │
│ │
│ Path: s3://acme-ml-data/datasets/ │
│ │
│ ├── transactions/ │
│ │ ├── 2023/ │
│ │ └── 2024/ │
│ │ ├── q1_transactions.parquet [✓] │
│ │ ├── q2_transactions.parquet [✓] │
│ │ └── q3_transactions.parquet [✓] │
│ └── customers/ │
│ │
│ Selected: 3 files (2.1 GB total) │
│ │
│ [Cancel] [Next →] │
└─────────────────────────────────────────────────────────────────┘
Step 3: Configure Import
Set import options:
| Option | Description | Default |
|---|---|---|
| Name | Dataset identifier | Filename |
| Description | What this data represents | — |
| Type | Data modality | Auto-detected |
| Tags | Organization labels | — |
| Auto-profile | Run profiling after import | Enabled |
| Sampling | Import subset for large files | Disabled |
Configuration Form:
┌─────────────────────────────────────────────────────────────────┐
│ Configure Dataset │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Name: [customer_transactions_2024 ] │
│ │
│ Description: [Transaction data for Q1-Q3 2024 ] │
│ [including fraud labels ] │
│ │
│ Type: [Tabular ▼] │
│ │
│ Tags: [transactions] [fraud] [2024] [+] │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Options │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │ [✓] Auto-profile after import │ │
│ │ [ ] Sample data (for large files) │ │
│ │ Sample size: [10000] rows │ │
│ │ [✓] Infer data types │ │
│ │ [ ] First row is header │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ [← Back] [Import] │
└─────────────────────────────────────────────────────────────────┘
Step 4: Review and Import
Click Import to start the process. You'll see:
- Upload Progress — File transfer status
- Schema Detection — Automatic column type inference
- Profiling — Statistical analysis (if enabled)
Viewing Dataset Details
Click on any dataset to see its details:
┌─────────────────────────────────────────────────────────────────────────────┐
│ customer_transactions_2024 [⚙ Actions ▼]│
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Overview │ Schema │ Profile │ Versions │ Lineage │ Access │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ Status: Ready Created: Jan 15, 2024 │
│ Type: Tabular Updated: Jan 15, 2024 │
│ Rows: 100,000 Size: 45.2 MB │
│ Columns: 12 Version: v1.0 │
│ │
│ Description: │
│ Transaction data for Q1-Q3 2024 including fraud labels for │
│ training fraud detection models. │
│ │
│ Tags: [transactions] [fraud] [2024] [production] │
│ │
│ Quick Actions: │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Generate │ │ Apply │ │ Export │ │ Clone │ │
│ │ Synthetic │ │ Recipe │ │ │ │ │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Dataset Schema Tab
View and edit column definitions:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Schema [Edit Schema]│
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ │ Column │ Type │ Nullable │ Unique │ Sample Values ││
│ ├─────────────────────┼──────────┼──────────┼────────┼────────────────────┤│
│ │ transaction_id │ string │ No │ Yes │ TXN-2024-00001 ││
│ │ customer_id │ string │ No │ No │ CUST-10042 ││
│ │ timestamp │ datetime │ No │ No │ 2024-01-15 14:32 ││
│ │ amount │ float │ No │ No │ 127.50, 45.99 ││
│ │ currency │ string │ No │ No │ USD, EUR, GBP ││
│ │ merchant_id │ string │ No │ No │ MERCH-5521 ││
│ │ merchant_category │ string │ No │ No │ retail, food ││
│ │ card_type │ string │ Yes │ No │ credit, debit ││
│ │ is_international │ boolean │ No │ No │ true, false ││
│ │ is_fraud │ boolean │ No │ No │ true, false ││
│ │ fraud_type │ string │ Yes │ No │ card_theft, null ││
│ │ risk_score │ float │ Yes │ No │ 0.15, 0.89 ││
│ └─────────────────────┴──────────┴──────────┴────────┴────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘
Dataset Actions
| Action | Description |
|---|---|
| Generate Synthetic | Create synthetic version |
| Apply Recipe | Transform with a recipe |
| Export | Download or save to cloud |
| Clone | Create a copy |
| Archive | Move to archive |
| Delete | Remove permanently |
Data Profiles
Understanding Data Profiles
A Data Profile captures the statistical "fingerprint" of your data. Profiles are automatically created during import but can also be manually defined.
Viewing Profile Results
After profiling completes:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Profile: customer_transactions_2024 │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Summary │ Columns │ Correlations │ Anomalies │ Quality Score │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ Overall Quality Score: 87/100 ████████▓░ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Metric │ Value │ Status │ │
│ ├─────────────────────────────┼────────────────┼─────────────────────┤ │
│ │ Total Rows │ 100,000 │ ✓ │ │
│ │ Total Columns │ 12 │ ✓ │ │
│ │ Missing Values │ 2.3% │ ⚠ Moderate │ │
│ │ Duplicate Rows │ 0.1% │ ✓ Low │ │
│ │ Outliers Detected │ 1.2% │ ✓ Low │ │
│ │ Type Consistency │ 99.8% │ ✓ High │ │
│ └─────────────────────────────┴────────────────┴─────────────────────┘ │
│ │
│ Class Distribution (is_fraud): │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ False (99%) ████████████████████████████████████████████████ │ │
│ │ True (1%) █ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ ⚠ Warning: Highly imbalanced classes detected │
└─────────────────────────────────────────────────────────────────────────────┘
Column-Level Statistics
Each column has detailed statistics:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Column: amount │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Type: float │
│ Missing: 0 (0.0%) │
│ Unique: 45,231 (45.2%) │
│ │
│ ┌───────────────────────┐ Statistics: │
│ │ Distribution │ ───────────────── │
│ │ │ Min: 0.01 │
│ │ ▂▅▇█▇▅▃▂▁ │ Max: 15,420.50 │
│ │ │ Mean: 127.43 │
│ │ 0 500 1000+ │ Median: 78.25 │
│ └───────────────────────┘ Std Dev: 245.67 │
│ Skewness: 2.34 (right-skewed) │
│ │
│ Distribution: Log-normal (best fit) │
│ │
│ Outliers: 1,234 values > 3 std deviations │
└─────────────────────────────────────────────────────────────────────────────┘
Creating Custom Profiles
For synthetic data generation, you may want to define profiles manually:
- Go to Synthex → Profiles
- Click Create Profile
- Define columns and constraints:
# Example profile definition
name: custom_transactions
description: Custom profile for transaction generation
columns:
- name: customer_id
type: string
generator: uuid
- name: amount
type: float
constraints:
min: 0.01
max: 10000
distribution:
type: log_normal
mean: 100
std: 200
- name: is_fraud
type: boolean
distribution:
true: 0.02 # 2% fraud rate
false: 0.98
- name: timestamp
type: datetime
constraints:
min: "2024-01-01"
max: "2024-12-31"
distribution:
type: uniform
correlations:
- columns: [amount, is_fraud]
type: positive
strength: 0.3 # Higher amounts slightly more likely to be fraud
Data Recipes
What are Recipes?
Data Recipes are reusable transformation pipelines that you can apply to datasets. They ensure:
- Reproducibility — Same transformations every time
- Consistency — Apply to multiple datasets
- Auditability — Track what was done to data
Creating a Recipe
- Go to Synthex → Recipes
- Click Create Recipe
- Add transformation steps:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Create Recipe │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Name: [fraud_data_preparation ] │
│ Description: [Prepare transaction data for fraud detection training] │
│ │
│ Steps: [+ Add Step] │
│ ───────────────────────────────────────────────────────────────────────── │
│ │
│ 1. ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Clean: Drop Null Values [✎] [×] │ │
│ │ Columns: customer_id, amount, timestamp │ │
│ │ Action: Drop rows with null values in specified columns │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ 2. ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Filter: Remove Invalid Transactions [✎] [×] │ │
│ │ Condition: amount > 0 AND amount < 100000 │ │
│ │ Action: Keep only rows matching condition │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ 3. ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Transform: Normalize Amount [✎] [×] │ │
│ │ Column: amount │ │
│ │ Method: Log transformation │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ 4. ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Encode: One-Hot Encoding [✎] [×] │ │
│ │ Column: merchant_category │ │
│ │ Output: merchant_category_retail, merchant_category_food, ... │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ [Cancel] [Save Recipe] │
└─────────────────────────────────────────────────────────────────────────────┘
Recipe Step Types
| Step Type | Description | Use Case |
|---|---|---|
| Clean | Handle missing/invalid data | Data quality |
| Filter | Remove rows by condition | Data selection |
| Transform | Modify column values | Feature engineering |
| Encode | Convert categorical data | ML preparation |
| Aggregate | Group and summarize | Feature creation |
| Join | Combine with other datasets | Data enrichment |
| Sample | Random subset | Testing, balancing |
| Split | Divide into subsets | Train/test split |
Applying a Recipe
To apply a recipe to a dataset:
- Go to the dataset detail page
- Click Apply Recipe
- Select the recipe
- Choose output options:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Apply Recipe │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Dataset: customer_transactions_2024 │
│ Recipe: [fraud_data_preparation ▼] │
│ │
│ Output Options: │
│ ────────────── │
│ [●] Create new version (recommended) │
│ Version name: [v1.1-cleaned ] │
│ │
│ [ ] Create new dataset │
│ Dataset name: [ ] │
│ │
│ [ ] Replace current version (destructive) │
│ │
│ Preview Changes: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Step 1: Drop Null Values │ │
│ │ → Rows affected: 2,312 (will be removed) │ │
│ │ │ │
│ │ Step 2: Filter Invalid Transactions │ │
│ │ → Rows affected: 45 (will be removed) │ │
│ │ │ │
│ │ Step 3: Normalize Amount │ │
│ │ → Column 'amount' will be log-transformed │ │
│ │ │ │
│ │ Step 4: One-Hot Encoding │ │
│ │ → 8 new columns will be created │ │
│ │ │ │
│ │ Final: 97,643 rows, 19 columns │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ [Cancel] [Apply Recipe] │
└─────────────────────────────────────────────────────────────────────────────┘
Synthetic Data Generation
Generation Methods
Synthex supports multiple synthetic data generation methods:
| Method | Description | Best For | Speed |
|---|---|---|---|
| Statistical | Preserves distributions and correlations | Tabular data, quick generation | Fast |
| GAN | Generative Adversarial Networks | Complex patterns, high fidelity | Slow |
| VAE | Variational Autoencoders | Balanced quality/speed | Medium |
| CTGAN | Conditional Tabular GAN | Mixed-type tabular data | Medium |
| CopulaGAN | Copula-based GAN | Preserving correlations | Medium |
| LLM | Large Language Models | Text data, complex semantics | Slow |
| TimeGAN | Temporal GAN | Time series data | Slow |
| Diffusion | Diffusion models | High-quality images | Very Slow |
Starting a Generation Job
- Go to Synthex → Datasets
- Select your source dataset
- Click Generate Synthetic
- Configure generation:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Generate Synthetic Data │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Source: customer_transactions_2024 (v1.0) │
│ │
│ Generation Method │
│ ────────────────── │
│ [●] Statistical (Copula) - Fast, good for tabular data │
│ [ ] CTGAN - Better for mixed types, slower │
│ [ ] CopulaGAN - Best correlation preservation │
│ [ ] GaussianCopula - Fastest, simple distributions │
│ │
│ Configuration │
│ ────────────── │
│ Number of records: [100000 ] │
│ │
│ Privacy Settings: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ [✓] Enable differential privacy │ │
│ │ Epsilon (ε): [1.0 ] (lower = more private) │ │
│ │ │ │
│ │ [✓] Anonymize identifiers │ │
│ │ Columns: customer_id, merchant_id │ │
│ │ │ │
│ │ [ ] Add noise to numerical columns │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Advanced Options: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Batch size: [1000 ] │ │
│ │ Random seed: [42 ] (for reproducibility) │ │
│ │ Constraint handling: [Reject invalid ▼] │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Output │
│ ────── │
│ [●] Create new synthetic dataset │
│ Name: [customer_transactions_2024_synthetic ] │
│ │
│ [ ] Augment existing dataset (combine with original) │
│ Augmentation ratio: [50% ] │
│ │
│ [Cancel] [Start Generation] │
└─────────────────────────────────────────────────────────────────────────────┘
Monitoring Generation Jobs
Track job progress in the Jobs view:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Job: fraud_synthetic_generation [Cancel Job] │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Status: Running │
│ Progress: ████████████████████░░░░░░░░░░ 65% │
│ │
│ Started: Jan 15, 2024 14:32:15 │
│ Elapsed: 12 minutes │
│ ETA: ~6 minutes remaining │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Stage │ Status │ Progress │ Duration │ │
│ ├──────────────────────────────┼───────────┼───────────┼─────────────┤ │
│ │ 1. Load source data │ Complete │ 100% │ 0:45 │ │
│ │ 2. Fit generator model │ Complete │ 100% │ 8:23 │ │
│ │ 3. Generate synthetic rows │ Running │ 65% │ 3:12 │ │
│ │ 4. Validate output │ Pending │ — │ — │ │
│ │ 5. Save results │ Pending │ — │ — │ │
│ └──────────────────────────────┴───────────┴───────────┴─────────────┘ │
│ │
│ Logs: │
│ ───────────────────────────────────────────────────────────────────────── │
│ [14:32:15] Starting generation job... │
│ [14:33:00] Source data loaded: 100,000 rows │
│ [14:33:02] Fitting CTGAN model... │
│ [14:41:25] Model training complete │
│ [14:41:26] Generating synthetic samples: batch 1/100 │
│ [14:44:38] Progress: 65,000/100,000 samples generated │
└─────────────────────────────────────────────────────────────────────────────┘
Evaluating Synthetic Data Quality
After generation, Synthex automatically evaluates quality:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Quality Report: customer_transactions_2024_synthetic │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Overall Quality Score: 92/100 █████████▓ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Metric │ Score │ Status │ │
│ ├──────────────────────────────────┼────────┼─────────────────────────┤ │
│ │ Statistical Similarity │ 94% │ ✓ Excellent │ │
│ │ Distribution Match (KL Div) │ 0.02 │ ✓ Very Low │ │
│ │ Correlation Preservation │ 91% │ ✓ Good │ │
│ │ Constraint Satisfaction │ 100% │ ✓ Perfect │ │
│ │ Privacy Score (k-anonymity) │ k=15 │ ✓ Strong │ │
│ │ ML Efficacy │ 89% │ ✓ Good │ │
│ └──────────────────────────────────┴────────┴─────────────────────────┘ │
│ │
│ Column-by-Column Comparison: │
│ ──────────────────────────────────────────────────────────────────────────│
│ │
│ amount: │
│ ┌────────────────────────┐ ┌────────────────────────┐ │
│ │ Original Distribution │ │ Synthetic Distribution │ │
│ │ ▂▅▇█▇▅▃▂▁ │ │ ▂▄▇█▇▅▃▂▁ │ │
│ │ 0 500 1000+ │ │ 0 500 1000+ │ │
│ └────────────────────────┘ └────────────────────────┘ │
│ Similarity: 96% ████████████████████░ │
│ │
│ is_fraud: │
│ Original: True: 1.02% False: 98.98% │
│ Synthetic: True: 1.01% False: 98.99% │
│ Similarity: 99% ████████████████████ │
│ │
│ [Export Report] [Download PDF] │
└─────────────────────────────────────────────────────────────────────────────┘
Generator Configurations
Saving Generator Configs
For repeatable generation, save your configurations:
# Example saved configuration
name: fraud_augmentation_config
description: Generate fraud cases for training data augmentation
source:
type: dataset
name: customer_transactions_2024
version: v1.0
filter: "is_fraud = true" # Learn only from fraud cases
method: ctgan
parameters:
epochs: 300
batch_size: 500
discriminator_steps: 1
log_frequency: true
privacy:
differential_privacy:
enabled: true
epsilon: 1.0
anonymize_columns:
- customer_id
- merchant_id
output:
records: 50000
format: parquet
destination: s3://ml-data/synthetic/
Managing Configurations
View and manage saved configs:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Generator Configurations [+ New Config] │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ │ Name │ Method │ Source Dataset │ Last Used ││
│ ├───────────────────────────┼─────────┼────────────────────┼──────────────┤│
│ │ fraud_augmentation │ CTGAN │ customer_trans... │ 2 days ago ││
│ │ review_generation │ GPT-4 │ product_reviews │ 1 week ago ││
│ │ sensor_timeseries │ TimeGAN │ sensor_readings │ 3 days ago ││
│ │ privacy_safe_customers │ Copula │ customer_data │ Today ││
│ └───────────────────────────┴─────────┴────────────────────┴──────────────┘│
│ │
│ Selected: fraud_augmentation │
│ [Run Now] [Edit] [Clone] [Delete] [View History] │
└─────────────────────────────────────────────────────────────────────────────┘
Integration with Model Training
The Synthex-Training Connection
Synthex and Model Training work together seamlessly:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Data-to-Model Workflow │
└─────────────────────────────────────────────────────────────────────────────┘
Synthex Model Training
┌────────────────────────────────────┐ ┌────────────────────────────────┐
│ │ │ │
│ ┌──────────┐ │ │ ┌──────────────────┐ │
│ │ Datasets │──────────────────────│───>│──────│ Dataset Selector │ │
│ └──────────┘ │ │ └────────┬─────────┘ │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ ┌──────────┐ │ │ ┌──────────────────┐ │
│ │ Profiles │ │ │ │ Experiment │ │
│ └──────────┘ │ │ │ Configuration │ │
│ │ │ │ └────────┬─────────┘ │
│ ▼ │ │ │ │
│ ┌──────────┐ ┌──────────┐ │ │ ▼ │
│ │ Recipes │─────>│ Versions │───│───>│ ┌──────────────────┐ │
│ └──────────┘ └──────────┘ │ │ │ Training Run │ │
│ │ │ └──────────────────┘ │
│ ┌──────────┐ │ │ │
│ │Synthetic │──────────────────────│───>│ │
│ └──────────┘ │ │ │
│ │ │ │
└────────────────────────────────────┘ └────────────────────────────────┘
Selecting Data in Training Wizard
When creating a training experiment:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Training Wizard - Step 2: Select Dataset │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Select data source for training: │
│ │
│ Data Source: [From Synthex ▼] │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Available Datasets [Search...] │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ □ customer_transactions_2024 │ │
│ │ │── v1.0 (original) 100,000 rows │ │
│ │ │── v1.1 (cleaned) 97,643 rows │ │
│ │ └── v2.0 (augmented) 147,643 rows ← Recommended │ │
│ │ │ │
│ │ ☑ customer_transactions_2024_synthetic │ │
│ │ └── v1.0 (generated) 100,000 rows │ │
│ │ │ │
│ │ □ product_reviews │ │
│ │ └── v1.0 (original) 50,000 rows │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Apply Recipe: [fraud_data_preparation ▼] (optional) │
│ │
│ Data Split: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Training: [70% ] ████████████████████████████░░░░░░░░░░░░░░░░ │ │
│ │ Validation: [15% ] ░░░░░░░░░░░░░░░░░░░░░░░░░░░░█████░░░░░░░░░░░ │ │
│ │ Test: [15% ] ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░█████░░░░░░ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Stratify by: [is_fraud ▼] (maintain class distribution) │
│ │
│ [← Back] [Next: Configure →] │
└─────────────────────────────────────────────────────────────────────────────┘
Lineage Tracking
Model Training records exactly which data was used:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Experiment: fraud-detector-v3 │
│ Run: run-2024-01-15-001 │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Data Lineage: │
│ ───────────── │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌───────────────────┐ ┌───────────────────┐ │ │
│ │ │ customer_trans... │────>│ fraud_data_prep │ │ │
│ │ │ v1.0 (original) │ │ (recipe applied) │ │ │
│ │ └───────────────────┘ └─────────┬─────────┘ │ │
│ │ │ │ │
│ │ ┌───────────────────┐ │ │ │
│ │ │ synthetic_fraud │ │ │ │
│ │ │ v1.0 (generated) │───────────────┤ │ │
│ │ └───────────────────┘ │ │ │
│ │ ▼ │ │
│ │ ┌───────────────────┐ │ │
│ │ │ Training Dataset │ │ │
│ │ │ 147,643 rows │ │ │
│ │ │ 70% train/15% val │ │ │
│ │ └─────────┬─────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌───────────────────┐ │ │
│ │ │ fraud-detector-v3 │ │ │
│ │ │ (trained model) │ │ │
│ │ └───────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ [View Full Lineage] [Export Lineage Report] │
└─────────────────────────────────────────────────────────────────────────────┘
Example Scenarios
Scenario 1: Augmenting Imbalanced Fraud Data
Problem: You have transaction data with only 1% fraud cases, leading to poor model performance on the minority class.
Solution: Generate synthetic fraud cases to balance the dataset.
Step 1: Analyze the Imbalance
- Go to Synthex → Datasets
- Open
customer_transactions - View the Profile tab
Class Distribution (is_fraud):
├── False: 99,000 (99%)
└── True: 1,000 (1%)
⚠ Warning: Severe class imbalance detected
Recommendation: Consider synthetic augmentation
Step 2: Create Fraud-Only Generator
- Click Generate Synthetic
- Configure:
- Filter source: is_fraud = true (learn only from fraud patterns)
- Method: CTGAN
- Records: 49,000 (to achieve ~50% fraud after augmentation)
# Generation config
source_filter: "is_fraud = true"
method: ctgan
parameters:
epochs: 500
batch_size: 100
output:
records: 49000
mode: augment # Combine with original
Step 3: Run Generation and Validate
After generation completes:
Augmented Dataset Summary:
├── Total rows: 149,000
├── Original (real): 100,000
├── Synthetic fraud: 49,000
└── Class distribution:
├── False: 99,000 (66.4%)
└── True: 50,000 (33.6%)
Quality Score: 91/100
Fraud pattern preservation: 94%
Step 4: Use in Training
- Go to Model Training → New Experiment
- Select the augmented dataset
- Enable stratified sampling
- Train your model
Result: Model recall on fraud cases improves from 45% to 82%.
Scenario 2: Creating Privacy-Safe Test Data
Problem: Your QA team needs realistic test data but cannot access production data due to privacy regulations.
Solution: Generate synthetic data that preserves statistical properties without containing real customer information.
Step 1: Import and Profile Production Data
Work with your data team to import a sample of production data into Synthex. Enable automatic profiling.
Step 2: Configure Privacy-Safe Generation
- Select the dataset
- Click Generate Synthetic
- Enable privacy features:
# Privacy-focused configuration
method: gaussian_copula
parameters:
default_distribution: parametric
privacy:
differential_privacy:
enabled: true
epsilon: 0.5 # Strong privacy guarantee
anonymize_columns:
- customer_id
- email
- phone
- address
pii_handling:
names: synthetic # Generate fake names
dates: shift # Randomly shift dates
amounts: noise # Add controlled noise
output:
records: 10000
format: csv
Step 3: Validate Privacy
Review the privacy report:
Privacy Assessment:
├── k-anonymity: k=50 (Strong)
├── l-diversity: l=10 (Strong)
├── Re-identification risk: <0.1% (Minimal)
└── PII detection: None found
All identifiers successfully anonymized.
No real customer data exposed.
Step 4: Export for QA
- Click Export
- Choose format (CSV, JSON)
- Download or send to cloud storage
Result: QA team has realistic test data without privacy concerns.
Scenario 3: Generating Training Data from Schema
Problem: You're building a new feature but have no historical data yet. You need realistic data to develop and test your model.
Solution: Define a data profile from scratch and generate synthetic data matching your expected schema.
Step 1: Create a Custom Profile
- Go to Synthex → Profiles
- Click Create Profile
- Define your expected schema:
name: subscription_churn_profile
description: Expected data for subscription churn prediction
columns:
- name: user_id
type: string
generator: uuid
- name: signup_date
type: datetime
constraints:
min: "2022-01-01"
max: "2024-01-01"
distribution:
type: uniform
- name: subscription_tier
type: category
values: [free, basic, premium, enterprise]
distribution:
free: 0.40
basic: 0.35
premium: 0.20
enterprise: 0.05
- name: monthly_usage_hours
type: float
constraints:
min: 0
max: 200
distribution:
type: beta
alpha: 2
beta: 5
scale: 200
- name: support_tickets
type: integer
constraints:
min: 0
max: 50
distribution:
type: poisson
lambda: 2
- name: churned
type: boolean
distribution:
true: 0.15
false: 0.85
correlations:
- columns: [monthly_usage_hours, churned]
type: negative
strength: 0.4 # Less usage → more likely to churn
- columns: [support_tickets, churned]
type: positive
strength: 0.3 # More tickets → more likely to churn
Step 2: Generate from Profile
- Go to Synthex → Generate Data
- Select From Profile
- Choose your custom profile
- Set record count (e.g., 50,000)
Step 3: Validate Generated Data
Review that distributions match expectations:
Generated Dataset Validation:
├── subscription_tier distribution: ✓ Matches profile
├── churned rate: 14.8% (expected 15%) ✓
├── usage-churn correlation: -0.38 (expected -0.4) ✓
└── All constraints satisfied: ✓
Step 4: Iterate and Refine
As you develop, refine your profile based on insights:
# Updated profile based on domain feedback
columns:
- name: monthly_usage_hours
type: float
# Refined: usage differs by tier
conditional_distribution:
on: subscription_tier
distributions:
free:
type: exponential
scale: 10
premium:
type: normal
mean: 80
std: 30
Result: Realistic development data aligned with business expectations.
Scenario 4: Text Data Generation for NLP
Problem: You need training data for a customer service classifier but have limited labeled examples.
Solution: Use LLM-based generation to create diverse training examples.
Step 1: Prepare Seed Examples
Upload a small dataset of labeled examples:
Category: billing_question
Examples:
- "Why was I charged twice this month?"
- "Can you explain this fee on my invoice?"
- "I need a refund for the overcharge"
Category: technical_support
Examples:
- "The app keeps crashing when I open it"
- "I can't log in to my account"
- "The feature isn't working as expected"
Category: general_inquiry
Examples:
- "What are your business hours?"
- "Do you ship internationally?"
- "How do I contact support?"
Step 2: Configure LLM Generation
- Select the seed dataset
- Choose LLM method
- Configure:
method: llm
parameters:
model: gpt-4
temperature: 0.8 # Some creativity
prompt_template: |
Generate a diverse customer service message for category: {category}
Examples of this category:
{examples}
Generate a new, unique message that fits this category.
Be diverse in tone, length, and specific issues.
preserve_label: true
variations_per_example: 10
constraints:
min_length: 20
max_length: 200
language: english
Step 3: Review and Filter
Generated examples are scored for quality:
Generated Examples (billing_question):
[Score: 0.95] "I noticed an unexpected $15 charge on my
statement dated March 3rd - could you help
me understand what this is for?"
[Score: 0.91] "hey there, pretty confused about my bill.
shows i paid but also have a balance due??"
[Score: 0.87] "Looking at my invoice #INV-2024-001, the
subtotal doesn't seem to match the itemized
charges. Please advise."
[Score: 0.45] "Payment question" [Filtered - too short]
Step 4: Export for Training
Export the quality-filtered dataset for NLP model training.
Result: Expanded training set from 100 to 1,000+ labeled examples.
Best Practices
Data Quality
| Practice | Description |
|---|---|
| Profile before generating | Always understand your source data's statistics |
| Validate outputs | Check that generated data matches expected distributions |
| Preserve correlations | Use methods that maintain relationships between columns |
| Test with real models | Validate synthetic data improves actual model performance |
Privacy & Security
| Practice | Description |
|---|---|
| Enable differential privacy | For sensitive data, always use DP guarantees |
| Anonymize identifiers | Never generate data that could identify real individuals |
| Audit before sharing | Review generated data before distributing |
| Document data lineage | Track which real data influenced synthetic outputs |
Performance
| Practice | Description |
|---|---|
| Start with statistical methods | Fastest and often sufficient |
| Use GANs for complex patterns | When statistical methods aren't capturing nuances |
| Batch large generations | Split very large generation jobs |
| Cache generator models | Reuse trained generators for multiple outputs |
Reproducibility
| Practice | Description |
|---|---|
| Set random seeds | Enable exact reproduction of results |
| Version your configs | Save and version generator configurations |
| Document generation params | Record all parameters used |
| Link to experiments | Track which generated data trained which models |
Limitations & Future Directions
Current Limitations
| Limitation | Description | Workaround |
|---|---|---|
| Complex dependencies | Very intricate column relationships may not be fully captured | Use domain-specific post-processing |
| Rare events | Extremely rare patterns (< 0.1%) are hard to learn | Oversample rare cases in source data |
| Sequential consistency | Time series generation may have temporal artifacts | Use TimeGAN or domain validation |
| Image quality | High-resolution images require significant compute | Start with lower resolution, upscale |
| Multi-table | Cross-table relationships require manual configuration | Use relational synthesis features |
Planned Features
- Federated synthesis — Generate from distributed data without centralizing
- Active learning integration — Prioritize generation of samples that help models most
- Real-time generation — Stream synthetic data on-demand
- AutoML for generation — Automatically select best method and parameters
- Multi-modal synthesis — Generate coherent text + tabular + image combinations
Method Selection Guide
┌─────────────────────────────────────────────────────────────────────────────┐
│ Choosing a Generation Method │
└─────────────────────────────────────────────────────────────────────────────┘
What type of data?
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
Tabular Text Time Series
│ │ │
┌─────┴─────┐ │ ┌──────┴──────┐
▼ ▼ ▼ ▼ ▼
Simple? Complex? Short/ Long/ Regular?
│ │ Labels? Context? │
▼ ▼ │ │ ▼
Statistical CTGAN/ ▼ ▼ TimeGAN
(Copula) CopulaGAN Template LLM/GPT
API Reference
Datasets API
Base URL: /api/v1/datasets
GET / List all datasets
POST / Create/import dataset
GET /{id} Get dataset details
PUT /{id} Update dataset
DELETE /{id} Delete dataset
GET /{id}/versions List versions
GET /{id}/profile Get profile results
POST /{id}/profile Trigger profiling
GET /{id}/sample Get data sample
POST /{id}/export Export dataset
Profiles API
Base URL: /api/v1/profiles
GET / List all profiles
POST / Create profile
GET /{id} Get profile details
PUT /{id} Update profile
DELETE /{id} Delete profile
POST /{id}/validate Validate data against profile
Recipes API
Base URL: /api/v1/recipes
GET / List all recipes
POST / Create recipe
GET /{id} Get recipe details
PUT /{id} Update recipe
DELETE /{id} Delete recipe
POST /{id}/apply Apply recipe to dataset
POST /{id}/preview Preview recipe results
Generators API
Base URL: /api/v1/generators
GET / List generator configs
POST / Create generator config
GET /{id} Get config details
PUT /{id} Update config
DELETE /{id} Delete config
POST /{id}/run Start generation job
Jobs API
Base URL: /api/v1/jobs
GET / List all jobs
GET /{id} Get job details
POST /{id}/cancel Cancel running job
GET /{id}/logs Get job logs
GET /{id}/results Get job results
Conclusion
Synthex is a powerful tool for managing your ML data lifecycle. Whether you're:
- Augmenting imbalanced datasets to improve model performance
- Generating privacy-safe data for development and testing
- Creating training data from scratch when real data doesn't exist
- Preparing data with reproducible recipes for consistent pipelines
...Synthex provides the capabilities you need.
For questions or feedback, consult your Inwire administrator or visit the User Guide.
See Also
- Backend Services Overview
- Using Inwire — Day-to-day workflows
- Setting up Inwire — Configuration guide
- User Guide Home — Main documentation index