Usage
Usage as standalone application
The script main.py can be used directly in the command line after
entering the virtual environment with pipenv shell.
The config.yaml file contains defaults for all variables that need
to be set by the user.
Warning
Using lab_strain=True has the following two requirements:
- The model already contains GeneProduct identifiers containing valid NCBI Protein/RefSeq identifiers.
If there is no available data for the modeled organism in any database these identifiers can be added with the pipeline described in Pipeline: From genome sequence to draft model before draft model creation.
- Input of a FASTA file containing header lines similar to:
>lcl|CP035291.1_prot_QCY37216.1_1 [gene=dnaA] [locus_tag=EQ029_00005] [protein=chromosomal replication initiator protein DnaA] [protein_id=QCY37216.1] [location=1..1356] [gbkey=CDS] Of the description part in the header line only locus_tag, protein and protein_id are important for
polish.
Description: >
This file can be adapted to choose what refineGEMs should do.
Note: For windows use \ instead of / for the paths
General Setting: >
Path to GEM to be investigated
model: 'data/e_coli_core.xml'
# Set the out path for all analysis files
out_path: ''
Settings for scripts that investigate the model: >
These are only necessary if none of the scripts to manipulate the model are used.
# Set to TRUE if you want pngs that aid in model investigation, will be saved to a folder called 'visualization'
visualize: TRUE
# Set the basis medium to simulate growth from
growth_basis: 'minimal_uptake' # 'default_uptake' or 'minimal_uptake'
# Set to TRUE if you want to simulate anaerobic growth
anaerobic_growth: FALSE
# Settings if you want to compare multiple models
multiple: FALSE
multiple_paths: # enter as many paths as you need below
- 'data/e_coli_core.xml'
- ''
- ''
single: TRUE # set to False if you only want to work with the multiple models
# media to simulate growth from, just comment the media you do not want with a #
media:
- 'LB'
- 'RPMI'
- 'M9'
- 'SNM3'
- 'CGXII'
- 'CasA'
- 'Blood'
- 'dGMM'
- 'MP-AU'
# Determine whether the biomass function should be checked & normalised
biomass: TRUE
# determine whether the memote score should be calculated, default: FALSE
memote: FALSE
# compare metabolites to the ModelSEED database
modelseed: FALSE # set to False if not needed
Settings for scripts that manipulate the model: >
They are all split into the ON / OFF switch (TRUE / FALSE) and additional settings like a path to where the new model should be saved.
model_out: '' # path and filename to where to save the modified model
entrez_email: '' # necessary to access NCBI API
### Addition of KEGG Pathways as Groups ###
keggpathways: FALSE
### SBO-Term Annotation ###
sboterms: FALSE
### Model polishing ### The database of the model identifiers needs to be specified with 'id_db'
polish: FALSE
id_db: 'BIGG' # Required!
# Possible identifiers, currently: BiGG & VMH
# For other IDs the `polish` function in `polish.py` might need adjustment
lab_strain: FALSE # Needs to be set to ensure that protein IDs get the 'bqbiol:isHomologTo' qualifier
# & to set the locus_tag to the ones obtained by the annotation
protein_fasta: '' # Path to used CarveMe input file, if exists; Needs to be set for lab_strain: True
### Charge correction ###
charge_corr: FALSE
### Manual Curation ###
man_cur: FALSE
man_cur_type: 'gapfill' # either 'gapfill' or 'metabs'
man_cur_table: 'data/manual_curation.xlsx'
### Automatic gap filling ###
# All parameters are required for all db_to_compare choices except:
# - organismid which is only required for db_to_compare: 'KEGG'/'KEGG+BioCyc'
# - and biocyc_files which is not required for 'KEGG'
gap_analysis: FALSE
gap_analysis_params:
db_to_compare: 'KEGG' # One of the choices KEGG|BioCyc|KEGG+BioCyc
organismid: 'T05059' # Needs to be specified for KEGG
gff_file: 'data/cstr.gff' # Path to RefSeq GFF file
biocyc_files:
- 'Path0' # Path to TXT file containing a SmartTable from BioCyc with the columns 'Accession-2' 'Reaction of gene' (-)
- 'Path1' # Path to TXT file containing a SmartTable with all reaction relevant information (*)
- 'Path2' # Path to TXT file containing a SmartTable with all metabolite relevant information (+)
- 'Path3' # Path to protein FASTA file used as input for CarveMe (Needed to get the protein IDs from the locus tags)
# (-) If the organism is not in BioCyc retrieve a table mapping all reactions in BioCyc to the corresponding sequence
# (*) 'Reaction' 'Reactants of reaction' 'Products of reaction' 'EC-Number' 'Reaction-Direction' 'Spontaneous?'
# (+) 'Compound' 'Object ID' 'Chemical Formula' 'InChI-Key' 'ChEBI'
gapfill_model: FALSE
gap_analysis_file: 'Path to Excel file with which gaps in model should be filled'
# Either obtained by running gapfill_analysis/Created by hand with the same structure as the result file from gapfill_analysis
# Example Excel file to fill in by hand: data/modelName_gapfill_analysis_date_example.xlsx
The repository structure has the following intention:
refineGEMs/contains all the functions needed inmain.pydata/contains all tables that are used by different parts of the script as well as a toy modele_coli_core.xmlInstead of using the files given in
data/, you can use your own files and just change the paths inconfig.yaml. Please be aware that some functions rely on input in a certain format so make sure to check the files given in thedata/folder and use the same formatting.databases/contains thesqlfile as well as thedbfile necessary for the SBOAnn script by Elisabeth Fritze as well as the modulesgapfill,growthandmodelseed.The
setup.pyandpyproject.tomlenable creating a PyPi package calledrefineGEMs.
Usage as python module
See examples to learn how to use refineGEMs. Note that at this time most of the modules only make sense when you use the respective main functions:
Warning
Using lab_strain=True has the following two requirements:
- The model already contains GeneProduct identifiers containing valid NCBI Protein/RefSeq identifiers.
If there is no available data for the modeled organism in any database these identifiers can be added with the pipeline described in Pipeline: From genome sequence to draft model before draft model creation.
- Input of a FASTA file containing header lines similar to:
>lcl|CP035291.1_prot_QCY37216.1_1 [gene=dnaA] [locus_tag=EQ029_00005] [protein=chromosomal replication initiator protein DnaA] [protein_id=QCY37216.1] [location=1..1356] [gbkey=CDS] Of the description part in the header line only locus_tag, protein and protein_id are important for
polish.
- refinegems.polish.polish(model: libsbml.Model, email: str, id_db: str, protein_fasta: str, lab_strain: bool, path: str) libsbml.Model
- Completes all steps to polish a model | (Tested for models having either BiGG or VMH identifiers.)
- Args:
model (libModel): model loaded with libSBML
email (str): E-mail for Entrez
id_db (str): Main database where identifiers in model come from
protein_fasta (str): File used as input for CarveMe
lab_strain (bool): True if the strain was sequenced in a local lab
path (str): Output path for incorrect annotations file(s)
- Returns:
libModel: Polished libSBML model
- refinegems.modelseed.compare_to_modelseed(model: cobra.Model) tuple[pandas.DataFrame, pandas.DataFrame]
Executes all steps to compare model metabolites to ModelSEED metabolites
- Args:
model (cobraModel): Model loaded with COBRApy
- Returns:
- tuple: Tables with charge (1) & formula (2) mismatches
pd.DataFrame: Table with charge mismatches
pd.DataFrame: Table with formula mismatches
- refinegems.pathways.kegg_pathways(modelpath: str) tuple[libsbml.Model, list[str]]
Executes all steps to add KEGG pathways as groups
- Args:
modelpath (str): Path to GEM
- Returns:
- tuple: libSBML model (1) & List of reactions without KEGG Id (2)
libModel: Modified model with Pathways as groups
list: Ids of reactions without KEGG annotation
- refinegems.sboann.sbo_annotation(model_libsbml: libsbml.Model) libsbml.Model
Executes all steps to annotate SBO terms to a given model (former main function of original script by Elisabeth Fritze)
- Args:
model_libsbml (libModel): Model loaded with libsbml
- Returns:
libModel: Modified model with SBO terms
- refinegems.biomass.check_normalise_biomass(model: cobra.Model) cobra.Model | None
Checks if at least one biomass reaction is present
For each found biomass reaction checks if it sums up to 1g[CDW]
Normalises the coefficients of each biomass reaction where the sum is not 1g[CDW] until the sum is 1g[CDW]
Returns model with adjusted biomass function(s)
- Args:
model (cobraModel): Model loaded with COBRApy
- Returns:
cobraModel: COBRApy model with adjusted biomass functions
The modules io, cvterms and investigate provide functions that can be used by themselves.
io
Provides a couple of helper functions to load models, databases and parse gfffiles
Provides functions to load and write models, media definitions and the manual annotation table
Depending on the application the model needs to be loaded with cobra (memote) or with libSBML (activation of groups). The media definitions are denoted in a csv within the data folder of this repository, thus the functions will only work if the user clones the repository. The manual_annotations table has to follow the specific layout given in the data folder in order to work with this module.
- refinegems.io.load_a_table_from_database(table_name_or_query: str) pandas.DataFrame
- Loads the table for which the name is provided or a table containing all rows for which the query evaluates to | true from the refineGEMs database (‘data/database/data.db’)
- Args:
table_name_or_query (str): Name of a table contained in the database ‘data.db’/ a SQL query
- Returns:
pd.DataFrame: Containing the table for which the name was provided from the database ‘data.db’
- refinegems.io.load_all_media_from_db(mediumpath: str) pandas.DataFrame
Helper function to extract media definitions from media_db.csv
- Args:
mediumpath (str): Path to csv file with medium database
- Returns:
pd.DataFrame: Table from csv with metabs added as BiGG_EX exchange reactions
- refinegems.io.load_document_libsbml(modelpath: str) libsbml.SBMLDocument
Loads model document using libSBML
- Args:
modelpath (str): Path to GEM
- Returns:
SBMLDocument: Loaded document by libSBML
- refinegems.io.load_manual_annotations(tablepath: str = 'data/manual_curation.xlsx', sheet_name: str = 'metab') pandas.DataFrame
Loads metabolite sheet from manual curation table
- Args:
tablepath (str): Path to manual curation table. Defaults to ‘data/manual_curation.xlsx’.
sheet_name (str): Sheet name for metabolite annotations. Defaults to ‘metab’.
- Returns:
pd.DataFrame: Table containing specified sheet from Excel file
- refinegems.io.load_manual_gapfill(tablepath: str = 'data/manual_curation.xlsx', sheet_name: str = 'gapfill') pandas.DataFrame
Loads gapfill sheet from manual curation table
- Args:
tablepath (str): Path to manual curation table. Defaults to ‘data/manual_curation.xlsx’.
sheet_name (str): Sheet name for reaction gapfilling. Defaults to ‘gapfill’.
- Returns:
pd.DataFrame: Table containing sheet with name ‘gapfill’|specified sheet_name from Excel file
- refinegems.io.load_medium_custom(mediumpath: str) pandas.DataFrame
Helper function to read medium csv
- Args:
mediumpath (str): path to csv file with medium
- Returns:
pd.DataFrame: Table of csv
- refinegems.io.load_medium_from_db(mediumname: str) pandas.DataFrame
Wrapper function to extract subtable for the requested medium from the database ‘data.db’
- Args:
mediumname (str): Name of medium to test growth on
- Returns:
pd.DataFrame: Table containing composition for one medium with metabs added as BiGG_EX exchange reactions
- refinegems.io.load_model_cobra(modelpath: str) cobra.Model
Loads model using COBRApy
- Args:
modelpath (str): Path to GEM
- Returns:
cobraModel: Loaded model by COBRApy
- refinegems.io.load_model_libsbml(modelpath: str) libsbml.Model
Loads model using libSBML
- Args:
modelpath (str): Path to GEM
- Returns:
libModel: loaded model by libSBML
- refinegems.io.load_multiple_models(models: list[str], package: str) list
Loads multiple models into a list
- Args:
models (list): List of paths to models
package (str): COBRApy|libSBML
- Returns:
list: List of model objects loaded with COBRApy|libSBML
- refinegems.io.parse_dict_to_dataframe(str2list: dict) pandas.DataFrame
- Parses dictionary of form {str: list} & | Transforms it into a table with a column containing the strings and a column containing the lists
- Args:
str2list (dict): Dictionary mapping strings to lists
- Returns:
pd.DataFrame: Table with column containing the strings and column containing the lists
- refinegems.io.parse_fasta_headers(filepath: str, id_for_model: bool = False) pandas.DataFrame
Parses FASTA file headers to obtain:
the protein_id
and the model_id (like it is obtained from CarveMe)
corresponding to the locus_tag
- Args:
filepath (str): Path to FASTA file
id_for_model (bool): True if model_id similar to autogenerated GeneProduct ID should be contained in resulting table
- Returns:
pd.DataFrame: Table containing the columns locus_tag, Protein_id & Model_id
- refinegems.io.parse_gff_for_gp_info(gff_file: str) pandas.DataFrame
Parses gff file of organism to find gene protein reactions based on locus tags
- Args:
gff_file (str): Path to gff file of organism of interest
- Returns:
pd.DataFrame: Table containing mapping from locus tag to GPR
- refinegems.io.save_user_input(configpath: str) dict[slice(<class 'str'>, <class 'str'>, None)]
This aims to collect user input from the command line to create a config file, will also save the user input to a config if no config was given
- Args:
configpath (str): Path to config file if present
- Returns:
dict: Either loaded config file or created from user input
- refinegems.io.search_ncbi_for_gpr(locus: str) str
Fetches protein name from NCBI
- Args:
locus (str): NCBI compatible locus_tag
- Returns:
str: Protein name|description
- refinegems.io.search_sbo_label(sbo_number: str) str
Looks up the SBO label corresponding to a given SBO Term number
- Args:
sbo_number (str): Last three digits of SBO-Term as str
- Returns:
str: Denoted label for given SBO Term
- refinegems.io.validate_libsbml_model(model: libsbml.Model) int
Debug method: Validates a libSBML model with the libSBML validator Args:
model (libModel): A libSBML model
- Returns:
int: Integer specifying if vaidate was successful or not
- refinegems.io.write_report(dataframe: pandas.DataFrame, filepath: str)
Writes reports stored in dataframes to xlsx file
- Args:
dataframe (pd.DataFrame): Table containing output
filepath (str): Path to file with filename
- refinegems.io.write_to_file(model: libsbml.Model, new_filename: str)
Writes modified model to new file
- Args:
model (libModel): Model loaded with libSBML
new_filename (str): Filename|Path for modified model
investigate
Provides a couple of functions to get parameters of your model
Provides functions to investigate the model and test with MEMOTE
These functions enable simple testing of any model using MEMOTE and access to its number of reactions, metabolites and genes.
- refinegems.investigate.get_egc(model: cobra.Model) pandas.DataFrame
Energy-generating cycles represent thermodynamically infeasible states. Charging of energy metabolites without any energy source causes such cycles. Detection method is based on (Fritzemeier et al., 2017)
- Args:
model (cobraModel): Model loaded with COBRApy
- Returns:
pd.DataFrame: Table with possible EGCs
- refinegems.investigate.get_mass_charge_unbalanced(model: cobra.Model) tuple[list[str], list[str]]
Creates lists of mass and charge unbalanced reactions,vwithout exchange reactions since they are unbalanced per definition
- Args:
model (cobraModel): Model loaded with COBRApy
- Returns:
tuple: Lists of reactions that might cause errors (1) & (2) (1) list: List of mass unbalanced reactions (2) list: List of charge unbalanced reactions
- refinegems.investigate.get_memote_score(memote_report: dict) float
Extracts MEMOTE score from report
- Args:
memote_report (dict): Output from run_memote.
- Returns:
float: MEMOTE score
- refinegems.investigate.get_metabs_with_one_cvterm(model: libsbml.Model) list[str]
Reports metabolites which have only one annotation, can be used as basis for further annotation research
- Args:
model (libModel): Model loaded with libSBML
- Returns:
list: Metabolite Ids with only one annotation
- refinegems.investigate.get_model_info(modelpath: str) pandas.DataFrame
Reports core information of given model
- Args:
modelpath (str): Path to model file
- Returns:
pd.DataFrame: Overview on model parameters
- refinegems.investigate.get_orphans_deadends_disconnected(model: cobra.Model) tuple[list[str], list[str], list[str]]
Uses MEMOTE functions to extract orphans, deadends and disconnected metabolites
- Args:
model (cobraModel): Model loaded with COBRApy
- Returns:
- tuple: Lists of metabolites that might cause errors (1) - (3)
list: List of orphans
list: List of deadends
list: List of disconnected metabolites
- refinegems.investigate.get_reactions_per_sbo(model: libsbml.Model) dict
Counts number of reactions of all SBO Terms present
- Args:
model (libModel): Model loaded with libSBML
- Returns:
dict: SBO Term as keys and number of reactions as values
- refinegems.investigate.initial_analysis(model: libsbml.Model) tuple[str, int, int, int]
Extracts most important numbers of GEM
- Args:
model (libModel): Model loaded with libSBML
- Returns:
- tuple: Model name (1) & corresponding amounts of entities (2) - (4)
str: Name of model
int: Number of reactions
int: Number of metabolites
int: Number of genes
- refinegems.investigate.parse_reaction(eq: str, model: cobra.Model) dict
Parses a reaction equation string to dictionary
- Args:
eq (str): Equation of a reaction
model (cobraModel): Model loaded with COBRApy
- Returns:
dict: Metabolite Ids as keys and their coefficients as values (negative = educts, positive = products)
- refinegems.investigate.plot_rea_sbo_single(model: libsbml.Model)
Plots reactions per SBO Term in horizontal bar chart
- Args:
model (libModel): Model loaded with libSBML
- Returns:
plot: Pandas Barchart
- refinegems.investigate.run_memote(model: cobra.Model) dict
Runs MEMOTE to obtain report as dict
- Args:
model (cobraModel): Model loaded with COBRApy
- Returns:
dict: MEMOTE report as json in dict format
- refinegems.investigate.run_memote_sys(model: cobra.Model)
Run MEMOTE on the local linux machine
- Args:
model (cobraModel): Model loaded with COBRApy
cvterms
Provides functions to work with cvterms
Helper module to work with annotations (CVTerms)
Stores dictionaries which hold information the identifiers.org syntax, has functions to add CVTerms to different entities and parse CVTerms.
- refinegems.cvterms.add_cv_term_genes(entry: str, db_id: str, gene: libsbml.GeneProduct, lab_strain: bool = False)
Adds CVTerm to a gene
- Args:
entry (str): Id to add as annotation
db_id (str): Database to which entry belongs. Must be in gene_db_dict.keys().
gene (GeneProduct): Gene to add CVTerm to
lab_strain (bool, optional): For locally sequenced strains the qualifiers are always HOMOLOG_TO. Defaults to False.
- refinegems.cvterms.add_cv_term_metabolites(entry: str, db_id: str, metab: libsbml.Species)
Adds CVTerm to a metabolite
- Args:
entry (str): Id to add as annotation
db_id (str): Database to which entry belongs. Must be in metabol_db_dict.keys().
metab (Species): Metabolite to add CVTerm to
- refinegems.cvterms.add_cv_term_pathways(entry: str, db_id: str, path: libsbml.Group)
Add CVTerm to a groups pathway
- Args:
entry (str): Id to add as annotation
db_id (str): Database to which entry belongs. Must be in pathway_db_dict.keys().
path (Group): Pathway to add CVTerm to
- refinegems.cvterms.add_cv_term_pathways_to_entity(entry: str, db_id: str, reac: libsbml.Reaction)
Add CVTerm to a reaction as OCCURS IN pathway
- Args:
entry (str): Id to add as annotation
db_id (str): Database to which entry belongss
reac (Reaction): Reaction to add CVTerm to
- refinegems.cvterms.add_cv_term_reactions(entry: str, db_id: str, reac: libsbml.Reaction)
Adds CVTerm to a reaction
- Args:
entry (str): Id to add as annotation
db_id (str): Database to which entry belongs. Must be in reaction_db_dict.keys().
reac (Reaction): Reaction to add CVTerm to
- refinegems.cvterms.add_cv_term_units(unit_id: str, unit: libsbml.Unit, relation: int)
Adds CVTerm to a unit
- Args:
unit_id (str): ID to add as URI to annotation
unit (Unit): Unit to add CVTerm to
relation (int): Provides model qualifier to be added
- refinegems.cvterms.generate_cvterm(qt, b_m_qt) libsbml.CVTerm
Generates a CVTerm with the provided qualifier & biological or model qualifier types
- Args:
qt (libSBML qualifier type): BIOLOGICAL_QUALIFIER or MODEL_QUALIFIER
b_m_qt (libSBML qualifier): BQM_IS, BQM_IS_HOMOLOG_TO, etc.
- Returns:
CVTerm: With provided qualifier & biological or model qualifier types
- refinegems.cvterms.get_id_from_cv_term(entity: libsbml.SBase, db_id: str) list[str]
Extract Id for a specific database from CVTerm
- Args:
entity (SBase): Species, Reaction, Gene, Pathway
db_id (str): Database of interest
- Returns:
list[str]: Ids of entity belonging to db_id
- refinegems.cvterms.print_cvterm(cvterm: libsbml.CVTerm)
Debug function: Prints the URIs contained in the provided CVTerm along with the provided qualifier & biological/model qualifier types
- Args:
cvterm (CVTerm): A libSBML CVTerm