Quickstart
Prerequisites
- python>=3.10
- r-base>=4.4.2
- conda
This program has been tested on Mac M1 and Ubuntu/linux
Installation
git clone https://github.com/acolorado1/DietMicrobeNet.git # clone repo
cd DietMicrobeNet # move into this project directory
conda env create -f DMnet_env.yaml # create environment
conda activate DietMicrobeNet # activate environment
pip install -e . # set up directory structure
wget 'https://olucdenver-my.sharepoint.com/:x:/g/personal/angelasofia_burkhartcolorado_cuanschutz_edu/ESXx7vpypQFOt4iVv6x-ErkBykpAVS1fppQjYZkrxkDnAA?download=1' -O Data/CompoundExternalDescriptor.csv
wget 'https://olucdenver-my.sharepoint.com/:x:/g/personal/angelasofia_burkhartcolorado_cuanschutz_edu/EYJUYQWmY9VDlYZIAXpzpvEBzhrnViFZQjrikXIla_aPPg?download=1' -O Data/Content.csv
wget 'https://olucdenver-my.sharepoint.com/:x:/g/personal/angelasofia_burkhartcolorado_cuanschutz_edu/EXyRAlYs1htNlcwz5T67BxQBGO7HfOjmfIBlkOydM0BIAw?download=1' -O Data/Food.csv
wget 'https://olucdenver-my.sharepoint.com/:x:/g/personal/angelasofia_burkhartcolorado_cuanschutz_edu/EbY2fD3JTcNLomKFqQhY5jABAXN-60A80PmkngRynazocg?download=1' -O Data/hmdb.csv
wget 'https://olucdenver-my.sharepoint.com/:x:/g/personal/angelasofia_burkhartcolorado_cuanschutz_edu/EZ1pyHd616RFkR9zG6kenuoBhZDroHYTbaGmEfwpxFOHLg?download=1' -O Data/AllFood/food_meta.csv
Inputs
Your sample directory must contain the following files:
| File | Description |
|---|---|
foodb_foods_dataframe.csv |
Diet data for FooDB-based analysis |
kegg_organisms_dataframe.csv |
Diet data for genome-based analysis |
ko_taxonomy_abundance.csv |
Microbiome KO abundances |
noquote_ko.txt |
KO list without quotes |
Note
foodb_foods_dataframe.csv and kegg_organisms_dataframe.csv are only required
for their respective analysis modes. If running with --all-food, the diet input
file is not required as all foods from FooDB will be used automatically.
Example
The easiest way to run DMnet is through the included run_workflow.py wrapper script.
Arguments
| Argument | Required | Description |
|---|---|---|
--directories |
✅ | One or more absolute paths to sample directories, space-separated and quoted |
--foodb |
❌ | Enable FooDB-based analysis |
--genome |
❌ | Enable genome-based analysis |
--e-weights |
❌ | Weight edges by read abundance |
--n-weights |
❌ | Weight nodes by food frequency |
--include-orgs |
❌ | Include organism-level information |
--abundance-col |
❌ | Column name for abundance values (default: Abundance_RPKs) |
--all-food |
❌ | Use all foods from FooDB instead of sample-specific diet file |
--cores |
❌ | Number of cores (default: 1) |
--profile |
❌ | Snakemake profile to use |
--dry-run / -n |
❌ | Preview jobs without executing |
Tip
--cores, --profile, and --dry-run are Snakemake-specific arguments.
See Snakemake's documentation for details.
Run with test data
Test data is included in Data/test_sample/. To run the full pipeline on it:
python run_workflow.py \
--directories "/absolute/path/to/Data/test_sample" \
--foodb \
--genome \
--e-weights \
--n-weights \
--include-orgs \
--abundance-col "Abundance_RPKs"
Note
Always use absolute paths for --directories. We recommend doing a dry-run
first with --dry-run to verify the pipeline is configured correctly before executing.
Outputs
All outputs are written into your sample directory under output_fdb/ and/or output_gen/,
depending on which analysis modes were enabled.
Directory Structure
my_directory/
├── ko_taxonomy_abundance.csv
├── noquote_ko.txt
├── foodb_foods_dataframe.csv
├── kegg_organisms_dataframe.csv
├── run_info.txt # pipeline metadata for this run
├── output_fdb/ # FooDB-based analysis outputs
│ ├── food_meta.csv
│ ├── food_compound_report.html
│ ├── microbe_compound_report.html # only if --include-orgs and --n-weights
│ ├── AMON_output/
│ │ ├── AMON_log.txt
│ │ ├── gene_set_1_enrichment.tsv
│ │ ├── kegg_mapper.tsv
│ │ ├── origin_table.tsv
│ │ ├── enrichment_heatmap.png
│ │ ├── co_dict.json
│ │ ├── ko_dict.json
│ │ └── rn_dict.json
│ └── graph/
│ ├── M_nodes_df.csv
│ ├── M_edges_df.csv
│ ├── M_AbundanceDistribution.png
│ ├── M_FoodFrequencyDistribution.png
│ ├── network_summary.txt
│ ├── graph_results.csv
│ └── graph_results_report.html
└── output_gen/ # Genome-based analysis outputs
├── food_item_kos.csv
├── food_compound_report.html
├── microbe_compound_report.html # only if --include-orgs and --n-weights
├── org_KO/
│ ├── <one .txt file per food item>
│ └── joined.txt
├── AMON_output/
│ ├── AMON_log.txt
│ ├── gene_set_1_enrichment.tsv
│ ├── gene_set_2_enrichment.tsv
│ ├── kegg_mapper.tsv
│ ├── origin_table.tsv
│ ├── enrichment_heatmap.png
│ ├── venn.png
│ ├── co_dict.json
│ ├── ko_dict.json
│ └── rn_dict.json
└── graph/
├── WG_nodes_df.csv
├── WG_edges_df.csv
├── WG_AbundanceDistribution.png
├── WG_FoodFrequencyDistribution.png
├── network_summary.txt
├── graph_results.csv
└── graph_results_report.html
Key Output Files
| File | Description |
|---|---|
run_info.txt |
Pipeline version, run date, and full config used for this run |
food_compound_report.html |
Compounds identified in each food item |
microbe_compound_report.html |
Compounds predicted to be produced by microbes |
graph_results_report.html |
Network pattern analysis report with Neo4j query results |
network_summary.txt |
Summary statistics of the constructed network |
graph_results.csv |
Raw graph analysis results |
*_nodes_df.csv |
Node dataframe with optional frequency weights |
*_edges_df.csv |
Edge dataframe with optional abundance weights |
*_AbundanceDistribution.png |
Histogram of edge weights (requires --e-weights) |
*_FoodFrequencyDistribution.png |
Histogram of node weights (requires --n-weights) |
Note
microbe_compound_report.html is only generated when both --include-orgs and
--n-weights are specified.
Next Steps
Once networks and patterns have been found for each sample you can continue to do: