Usage

Usage as standalone application

The script main.py can be used directly in the command line after entering the virtual environment with pipenv shell.

The config.yaml file contains defaults for all variables that need to be set by the user.

Warning

Using lab_strain=True has the following two requirements:

  1. The model already contains GeneProduct identifiers containing valid NCBI Protein/RefSeq identifiers.

    If there is no available data for the modeled organism in any database these identifiers can be added with the pipeline described in Pipeline: From genome sequence to draft model before draft model creation.

  2. Input of a FASTA file containing header lines similar to:

    >lcl|CP035291.1_prot_QCY37216.1_1 [gene=dnaA] [locus_tag=EQ029_00005] [protein=chromosomal replication initiator protein DnaA] [protein_id=QCY37216.1] [location=1..1356] [gbkey=CDS] Of the description part in the header line only locus_tag, protein and protein_id are important for polish.

Description: >
  This file can be adapted to choose what refineGEMs should do.
  Note: For windows use \ instead of / for the paths


General Setting: >
  Path to GEM to be investigated

model: 'data/e_coli_core.xml'
# Set the out path for all analysis files
out_path: ''

Settings for scripts that investigate the model: >
  These are only necessary if none of the scripts to manipulate the model are used.

# Set to TRUE if you want pngs that aid in model investigation, will be saved to a folder called 'visualization'
visualize: TRUE

# Set the basis medium to simulate growth from
growth_basis: 'minimal_uptake' # 'default_uptake' or 'minimal_uptake'

# Set to TRUE if you want to simulate anaerobic growth
anaerobic_growth: FALSE

# Settings if you want to compare multiple models
multiple: FALSE
multiple_paths: # enter as many paths as you need below
  - 'data/e_coli_core.xml'
  - ''
  - ''
single: TRUE # set to False if you only want to work with the multiple models

# media to simulate growth from, just comment the media you do not want with a #
media:
  - 'LB'
  - 'RPMI'
  - 'M9'
  - 'SNM3'
  - 'CGXII'
  - 'CasA'
  - 'Blood'
  - 'dGMM'
  - 'MP-AU'

# Determine whether the biomass function should be checked & normalised
biomass: TRUE

# determine whether the memote score should be calculated, default: FALSE
memote: FALSE

# compare metabolites to the ModelSEED database
modelseed: FALSE # set to False if not needed


Settings for scripts that manipulate the model: >
  They are all split into the ON / OFF switch (TRUE / FALSE) and additional settings like a path to where the new model should be saved.

model_out: '' # path and filename to where to save the modified model
entrez_email: '' # necessary to access NCBI API

### Addition of KEGG Pathways as Groups ###
keggpathways: FALSE

### SBO-Term Annotation ###
sboterms: FALSE

### Model polishing ### The database of the model identifiers needs to be specified with 'id_db'
polish: FALSE
id_db: 'BIGG' # Required!
# Possible identifiers, currently: BiGG & VMH
# For other IDs the `polish` function in `polish.py` might need adjustment
lab_strain: FALSE # Needs to be set to ensure that protein IDs get the 'bqbiol:isHomologTo' qualifier
                  # & to set the locus_tag to the ones obtained by the annotation
protein_fasta: '' # Path to used CarveMe input file, if exists; Needs to be set for lab_strain: True

### Charge correction ###
charge_corr: FALSE

### Manual Curation ###
man_cur: FALSE
man_cur_type: 'gapfill' # either 'gapfill' or 'metabs'
man_cur_table: 'data/manual_curation.xlsx'

### Automatic gap filling ###
# All parameters are required for all db_to_compare choices except:
# - organismid which is only required for db_to_compare: 'KEGG'/'KEGG+BioCyc'
# - and biocyc_files which is not required for 'KEGG'
gap_analysis: FALSE
gap_analysis_params:
  db_to_compare: 'KEGG'  # One of the choices KEGG|BioCyc|KEGG+BioCyc
  organismid: 'T05059'  # Needs to be specified for KEGG
  gff_file: 'data/cstr.gff'  # Path to RefSeq GFF file
  biocyc_files:
    - 'Path0'  # Path to TXT file containing a SmartTable from BioCyc with the columns 'Accession-2' 'Reaction of gene' (-)
    - 'Path1'  # Path to TXT file containing a SmartTable with all reaction relevant information (*)
    - 'Path2'  # Path to TXT file containing a SmartTable with all metabolite relevant information (+)
    - 'Path3'  # Path to protein FASTA file used as input for CarveMe (Needed to get the protein IDs from the locus tags)
# (-) If the organism is not in BioCyc retrieve a table mapping all reactions in BioCyc to the corresponding sequence
# (*) 'Reaction' 'Reactants of reaction' 'Products of reaction' 'EC-Number' 'Reaction-Direction' 'Spontaneous?'
# (+) 'Compound' 'Object ID' 'Chemical Formula' 'InChI-Key' 'ChEBI'
gapfill_model: FALSE
gap_analysis_file: 'Path to Excel file with which gaps in model should be filled'
# Either obtained by running gapfill_analysis/Created by hand with the same structure as the result file from gapfill_analysis
# Example Excel file to fill in by hand: data/modelName_gapfill_analysis_date_example.xlsx

The repository structure has the following intention:

  • refineGEMs/ contains all the functions needed in main.py

  • data/ contains all tables that are used by different parts of the script as well as a toy model e_coli_core.xml

  • Instead of using the files given in data/, you can use your own files and just change the paths in config.yaml. Please be aware that some functions rely on input in a certain format so make sure to check the files given in the data/ folder and use the same formatting.

  • databases/ contains the sql file as well as the db file necessary for the SBOAnn script by Elisabeth Fritze as well as the modules gapfill, growth and modelseed.

  • The setup.py and pyproject.toml enable creating a PyPi package called refineGEMs.

Usage as python module

See examples to learn how to use refineGEMs. Note that at this time most of the modules only make sense when you use the respective main functions:

Warning

Using lab_strain=True has the following two requirements:

  1. The model already contains GeneProduct identifiers containing valid NCBI Protein/RefSeq identifiers.

    If there is no available data for the modeled organism in any database these identifiers can be added with the pipeline described in Pipeline: From genome sequence to draft model before draft model creation.

  2. Input of a FASTA file containing header lines similar to:

    >lcl|CP035291.1_prot_QCY37216.1_1 [gene=dnaA] [locus_tag=EQ029_00005] [protein=chromosomal replication initiator protein DnaA] [protein_id=QCY37216.1] [location=1..1356] [gbkey=CDS] Of the description part in the header line only locus_tag, protein and protein_id are important for polish.

refinegems.modelseed.compare_to_modelseed(model: cobra.Model) tuple[pandas.DataFrame, pandas.DataFrame]

Executes all steps to compare model metabolites to ModelSEED metabolites

Args:
  • model (cobraModel): Model loaded with COBRApy

Returns:
tuple: Tables with charge (1) & formula (2) mismatches
  1. pd.DataFrame: Table with charge mismatches

  2. pd.DataFrame: Table with formula mismatches

refinegems.pathways.kegg_pathways(modelpath: str) tuple[libsbml.Model, list[str]]

Executes all steps to add KEGG pathways as groups

Args:
  • modelpath (str): Path to GEM

Returns:
tuple: libSBML model (1) & List of reactions without KEGG Id (2)
  1. libModel: Modified model with Pathways as groups

  2. list: Ids of reactions without KEGG annotation

refinegems.sboann.sbo_annotation(model_libsbml: libsbml.Model) libsbml.Model

Executes all steps to annotate SBO terms to a given model (former main function of original script by Elisabeth Fritze)

Args:
  • model_libsbml (libModel): Model loaded with libsbml

Returns:

libModel: Modified model with SBO terms

refinegems.biomass.check_normalise_biomass(model: cobra.Model) cobra.Model | None
  1. Checks if at least one biomass reaction is present

  2. For each found biomass reaction checks if it sums up to 1g[CDW]

  3. Normalises the coefficients of each biomass reaction where the sum is not 1g[CDW] until the sum is 1g[CDW]

  4. Returns model with adjusted biomass function(s)

Args:

model (cobraModel): Model loaded with COBRApy

Returns:

cobraModel: COBRApy model with adjusted biomass functions

The modules io, cvterms and investigate provide functions that can be used by themselves.

io

Provides a couple of helper functions to load models, databases and parse gfffiles

Provides functions to load and write models, media definitions and the manual annotation table

Depending on the application the model needs to be loaded with cobra (memote) or with libSBML (activation of groups). The media definitions are denoted in a csv within the data folder of this repository, thus the functions will only work if the user clones the repository. The manual_annotations table has to follow the specific layout given in the data folder in order to work with this module.

refinegems.io.load_a_table_from_database(table_name_or_query: str) pandas.DataFrame
Loads the table for which the name is provided or a table containing all rows for which the query evaluates to | true from the refineGEMs database (‘data/database/data.db’)
Args:
  • table_name_or_query (str): Name of a table contained in the database ‘data.db’/ a SQL query

Returns:

pd.DataFrame: Containing the table for which the name was provided from the database ‘data.db’

refinegems.io.load_all_media_from_db(mediumpath: str) pandas.DataFrame

Helper function to extract media definitions from media_db.csv

Args:
  • mediumpath (str): Path to csv file with medium database

Returns:

pd.DataFrame: Table from csv with metabs added as BiGG_EX exchange reactions

refinegems.io.load_document_libsbml(modelpath: str) libsbml.SBMLDocument

Loads model document using libSBML

Args:
  • modelpath (str): Path to GEM

Returns:

SBMLDocument: Loaded document by libSBML

refinegems.io.load_manual_annotations(tablepath: str = 'data/manual_curation.xlsx', sheet_name: str = 'metab') pandas.DataFrame

Loads metabolite sheet from manual curation table

Args:
  • tablepath (str): Path to manual curation table. Defaults to ‘data/manual_curation.xlsx’.

  • sheet_name (str): Sheet name for metabolite annotations. Defaults to ‘metab’.

Returns:

pd.DataFrame: Table containing specified sheet from Excel file

refinegems.io.load_manual_gapfill(tablepath: str = 'data/manual_curation.xlsx', sheet_name: str = 'gapfill') pandas.DataFrame

Loads gapfill sheet from manual curation table

Args:
  • tablepath (str): Path to manual curation table. Defaults to ‘data/manual_curation.xlsx’.

  • sheet_name (str): Sheet name for reaction gapfilling. Defaults to ‘gapfill’.

Returns:

pd.DataFrame: Table containing sheet with name ‘gapfill’|specified sheet_name from Excel file

refinegems.io.load_medium_custom(mediumpath: str) pandas.DataFrame

Helper function to read medium csv

Args:
  • mediumpath (str): path to csv file with medium

Returns:

pd.DataFrame: Table of csv

refinegems.io.load_medium_from_db(mediumname: str) pandas.DataFrame

Wrapper function to extract subtable for the requested medium from the database ‘data.db’

Args:
  • mediumname (str): Name of medium to test growth on

Returns:

pd.DataFrame: Table containing composition for one medium with metabs added as BiGG_EX exchange reactions

refinegems.io.load_model_cobra(modelpath: str) cobra.Model

Loads model using COBRApy

Args:
  • modelpath (str): Path to GEM

Returns:

cobraModel: Loaded model by COBRApy

refinegems.io.load_model_libsbml(modelpath: str) libsbml.Model

Loads model using libSBML

Args:
  • modelpath (str): Path to GEM

Returns:

libModel: loaded model by libSBML

refinegems.io.load_multiple_models(models: list[str], package: str) list

Loads multiple models into a list

Args:
  • models (list): List of paths to models

  • package (str): COBRApy|libSBML

Returns:

list: List of model objects loaded with COBRApy|libSBML

refinegems.io.parse_dict_to_dataframe(str2list: dict) pandas.DataFrame
Parses dictionary of form {str: list} & | Transforms it into a table with a column containing the strings and a column containing the lists
Args:

str2list (dict): Dictionary mapping strings to lists

Returns:

pd.DataFrame: Table with column containing the strings and column containing the lists

refinegems.io.parse_fasta_headers(filepath: str, id_for_model: bool = False) pandas.DataFrame

Parses FASTA file headers to obtain:

  • the protein_id

  • and the model_id (like it is obtained from CarveMe)

corresponding to the locus_tag

Args:
  • filepath (str): Path to FASTA file

  • id_for_model (bool): True if model_id similar to autogenerated GeneProduct ID should be contained in resulting table

Returns:

pd.DataFrame: Table containing the columns locus_tag, Protein_id & Model_id

refinegems.io.parse_gff_for_gp_info(gff_file: str) pandas.DataFrame

Parses gff file of organism to find gene protein reactions based on locus tags

Args:
  • gff_file (str): Path to gff file of organism of interest

Returns:

pd.DataFrame: Table containing mapping from locus tag to GPR

refinegems.io.save_user_input(configpath: str) dict[slice(<class 'str'>, <class 'str'>, None)]

This aims to collect user input from the command line to create a config file, will also save the user input to a config if no config was given

Args:
  • configpath (str): Path to config file if present

Returns:

dict: Either loaded config file or created from user input

refinegems.io.search_ncbi_for_gpr(locus: str) str

Fetches protein name from NCBI

Args:
  • locus (str): NCBI compatible locus_tag

Returns:

str: Protein name|description

refinegems.io.search_sbo_label(sbo_number: str) str

Looks up the SBO label corresponding to a given SBO Term number

Args:
  • sbo_number (str): Last three digits of SBO-Term as str

Returns:

str: Denoted label for given SBO Term

refinegems.io.validate_libsbml_model(model: libsbml.Model) int

Debug method: Validates a libSBML model with the libSBML validator Args:

  • model (libModel): A libSBML model

Returns:

int: Integer specifying if vaidate was successful or not

refinegems.io.write_report(dataframe: pandas.DataFrame, filepath: str)

Writes reports stored in dataframes to xlsx file

Args:
  • dataframe (pd.DataFrame): Table containing output

  • filepath (str): Path to file with filename

refinegems.io.write_to_file(model: libsbml.Model, new_filename: str)

Writes modified model to new file

Args:
  • model (libModel): Model loaded with libSBML

  • new_filename (str): Filename|Path for modified model

investigate

Provides a couple of functions to get parameters of your model

cvterms

Provides functions to work with cvterms

Helper module to work with annotations (CVTerms)

Stores dictionaries which hold information the identifiers.org syntax, has functions to add CVTerms to different entities and parse CVTerms.

refinegems.cvterms.add_cv_term_genes(entry: str, db_id: str, gene: libsbml.GeneProduct, lab_strain: bool = False)

Adds CVTerm to a gene

Args:
  • entry (str): Id to add as annotation

  • db_id (str): Database to which entry belongs. Must be in gene_db_dict.keys().

  • gene (GeneProduct): Gene to add CVTerm to

  • lab_strain (bool, optional): For locally sequenced strains the qualifiers are always HOMOLOG_TO. Defaults to False.

refinegems.cvterms.add_cv_term_metabolites(entry: str, db_id: str, metab: libsbml.Species)

Adds CVTerm to a metabolite

Args:
  • entry (str): Id to add as annotation

  • db_id (str): Database to which entry belongs. Must be in metabol_db_dict.keys().

  • metab (Species): Metabolite to add CVTerm to

refinegems.cvterms.add_cv_term_pathways(entry: str, db_id: str, path: libsbml.Group)

Add CVTerm to a groups pathway

Args:
  • entry (str): Id to add as annotation

  • db_id (str): Database to which entry belongs. Must be in pathway_db_dict.keys().

  • path (Group): Pathway to add CVTerm to

refinegems.cvterms.add_cv_term_pathways_to_entity(entry: str, db_id: str, reac: libsbml.Reaction)

Add CVTerm to a reaction as OCCURS IN pathway

Args:
  • entry (str): Id to add as annotation

  • db_id (str): Database to which entry belongss

  • reac (Reaction): Reaction to add CVTerm to

refinegems.cvterms.add_cv_term_reactions(entry: str, db_id: str, reac: libsbml.Reaction)

Adds CVTerm to a reaction

Args:
  • entry (str): Id to add as annotation

  • db_id (str): Database to which entry belongs. Must be in reaction_db_dict.keys().

  • reac (Reaction): Reaction to add CVTerm to

refinegems.cvterms.add_cv_term_units(unit_id: str, unit: libsbml.Unit, relation: int)

Adds CVTerm to a unit

Args:
  • unit_id (str): ID to add as URI to annotation

  • unit (Unit): Unit to add CVTerm to

  • relation (int): Provides model qualifier to be added

refinegems.cvterms.generate_cvterm(qt, b_m_qt) libsbml.CVTerm

Generates a CVTerm with the provided qualifier & biological or model qualifier types

Args:
  • qt (libSBML qualifier type): BIOLOGICAL_QUALIFIER or MODEL_QUALIFIER

  • b_m_qt (libSBML qualifier): BQM_IS, BQM_IS_HOMOLOG_TO, etc.

Returns:

CVTerm: With provided qualifier & biological or model qualifier types

refinegems.cvterms.get_id_from_cv_term(entity: libsbml.SBase, db_id: str) list[str]

Extract Id for a specific database from CVTerm

Args:
  • entity (SBase): Species, Reaction, Gene, Pathway

  • db_id (str): Database of interest

Returns:

list[str]: Ids of entity belonging to db_id

refinegems.cvterms.print_cvterm(cvterm: libsbml.CVTerm)

Debug function: Prints the URIs contained in the provided CVTerm along with the provided qualifier & biological/model qualifier types

Args:

cvterm (CVTerm): A libSBML CVTerm