Running the Pipeline

This pipeline converts dietary data (e.g., FFQs) and microbial gene information into a graph structure for analyzing metabolic interactions between food and gut microbes.

Overview

graph LR
  A[FFQ / Food List] --> B[Compound Source]
  B --> C1[FooDB Workflow]
  B --> C2[Whole Genome Workflow]
  C1 --> D[Graph Data]
  C2 --> D
  D --> E[Pattern Extraction]
  E --> F[Visualization]

Step	Description	Required
1	Launch Streamlit app to create machine-readable FFQ	Optional
2	Choose compound source (FooDB or KEGG)	✅
3A/3B	Generate nodes and edges	✅
4	Microbial compound report	Optional
5	Build graph and extract patterns	✅
6	Visualize results	Optional

Step 1 — Create a Machine-Readable FFQ

Food Frequency Questionnaires (FFQs) capture how often participants consume specific foods, but they aren't directly usable in computational workflows due to format heterogeneity, lack of molecular resolution, and inconsistent structure.

A Streamlit app is provided to generate standardized, machine-readable FFQ datasets that map food items to compounds via FooDB or KEGG.

streamlit run src/get_foods.py

In the app: 1. Search for and select foods 2. Assign a consumption frequency (1–100%) 3. Download the generated dataset 4. Shut down the application

No FFQ? No problem.

You can run the pipeline using all foods available in FooDB. Skip food metadata creation in Step 3A, or pass the foodb and all-foods flags in the workflow runner. Food compound reports are skipped due to dataset size.

Step 2 — Choose a Compound Source

Option A: FooDB (Experimental)

FooDB contains compounds identified in foods via LC-MS experiments, many of which link to KEGG. Core metabolic compounds (amino acids, sugars, fatty acids, nucleotides) are well-represented, but specialized plant compounds (flavonoids, alkaloids, terpenes) may be missing.

Limitations: Incomplete compound coverage · Limited food representation · U.S.-centric food data

Option B: KEGG Whole Genomes (Genome-Based Prediction)

Compounds are inferred from an organism's genome based on its metabolic capabilities using KEGG organism data.

Limitations: Requires decomposing complex foods into components · Doesn't account for ripeness or cooking · Predictions may not reflect actual composition

Graph Creation Methods

Note

AMON takes a list of KOs and finds producible compounds via KEGG reactions, assigning their origin (dietary vs. microbial).

Step 3A — FooDB Workflow

1. Generate Food–Compound Metadata

Skip this step if using all FooDB foods — Data/AllFood/food_meta.csv is pre-built.

Rscript src/Metabolome_proc/comp_FoodDB.R \
  --diet_file  "Data/test_sample/foodb_foods_dataframe.csv" \
  --content_file "Data/Content.csv" \
  --ExDes_file "Data/CompoundExternalDescriptor.csv" \
  --meta_o_file "food_meta.csv"

Output: Food items mapped to KEGG compound IDs with aggregated consumption frequencies.

2. Generate Food Compound Report (optional)

Skip if using all foods — the dataset is too large.

python src/Metabolome_proc/RenderCompoundAnalysis.py \
  --food_file "food_meta.csv" \
  --output "food_compound_report.html"

3. Run AMON

amon.py \
  -i "Data/test_sample/noquote_ko.txt" \
  -o "AMON_output/" \
  --save_entries

4. Create Graph Data

python src/Metabolome_proc/main_metab.py \
  --f "food_meta.csv" \
  --r "AMON_output/rn_dict.json" \
  --m_meta "Data/test_sample/ko_taxonomy_abundance.csv" \
  --e-weights \
  --n-weights \
  --org \
  --a "Abundance_RPKs" \
  --o "graph/"

Step 3B — Whole Genome Workflow

1. Map Organisms to KOs

python src/WholeGenome_proc/comp_KEGG.py \
  -i "Data/test_sample/kegg_organisms_dataframe.csv" \
  -k "org_KO/" \
  -o "food_item_kos.csv"

2. Run AMON

amon.py \
  -i "Data/test_sample/noquote_ag_sample.txt" \
  -o "AMON_output/" \
  --other_gene_set "org_KO/joined.txt" \
  --save_entries

3. Create Graph Data

python src/WholeGenome_proc/main_geno.py \
  --f_meta "food_item_kos.csv" \
  --m_meta "Data/test_sample/ko_taxonomy_abundance.csv" \
  --mapper "AMON_output/kegg_mapper.tsv" \
  --rn_json "AMON_output/rn_dict.json" \
  --e-weights \
  --n-weights \
  --org \
  --a "Abundance_RPKs" \
  --o "graph/"

4. Generate Food Compound Report

python src/WholeGenome_proc/RenderCompoundAnalysis.py \
  --node_file "graph/WG_nodes_df.csv" \
  --output "food_compound_report.html"

Step 4 — Microbial Compound Report (optional)

Requires microbial taxonomy and abundance data.

python src/RenderCompoundAnalysis_Microbe.py \
  --node_file "graph/nodes.csv" \
  --edge_file "graph/edges.csv" \
  --output "microbe_compound_report.html"

Step 5 — Build Graph and Extract Patterns

python src/run_graph.py \
  --n "graph/nodes.csv" \
  --e "graph/edges.csv" \
  --o "graph_results.csv"

Patterns identified:

Pattern	Description
Food → Microbe	Compound produced by diet, consumed by microbe
Food → Both	Compound shared between diet and microbial production
Both → Both	Compound produced and consumed across both sources

Step 6 — Visualize Graph Results

python src/RenderGraphResults_Report.py \
  --patterns "graph_results.csv" \
  --rxn_json "AMON_output/rn_dict.json" \
  --output "graph_results_report.html"

Note

Steps 2–4 correspond to the general workflow described on the home page.

Tip

Use run_workflow.py to run Steps 3–4 automatically. See the Quick Start guide for a complete example.