Filling gaps with refineGEMs

There are two possibilities to use refineGEMs to fill gaps.

Manual gap filling

refinegems.curate.add_reactions_from_table(model: libsbml.Model, table: pandas.DataFrame, email: str) libsbml.Model

Wrapper function to use with table format given in data/manual_curation.xlsx, sheet gapfill: Adds all reactions with their info given in the table to the given model

Args:
  • model (libModel): Model loaded with libSBML

  • table (pd-DataFrame): Table in format of sheet gapfill from manual_curation.xlsx located in the data folder

  • email (str): User Email to access the NCBI Entrez database

Returns:

libModel: Modified model with new reactions

Automated gap filling

The gapfill module was created to enable an automatic way of filling gaps in a model via genes.

Warning

Current restrictions:
  • Only bacteria supported as missing reactions are filtered by compartments (‘c’, ‘e’, ‘p’).

  • Only models where the organisms have entries in KEGG and/or BioCyc can be gap filled.

Run times:
  • ‘KEGG’: ~ 2h

  • ‘BioCyc’: ~ 45mins - 1h

  • ‘KEGG+BioCyc’: ~ 3 - 4h

The module gapfill can be used to:

a. perform gap analysis

refinegems.gapfill.gap_analysis(model_libsbml: libsbml.Model, gapfill_params: dict[slice(<class 'str'>, <class 'str'>, None)], filename: str) pandas.DataFrame | tuple
Main function to infer gaps in a model by comparing the locus tags of the GeneProducts | to KEGG/BioCyc/both
Args:
  • model_libsbml (libModel): Model loaded with libSBML

  • gapfill_params (dict): Dictionary obtained from YAML file containing the parameter mappings

  • filename (str): Path to output file for gapfill analysis result

Returns:
  • Case ‘KEGG’

    pd.DataFrame: Table containing the columns ‘bigg_id’ ‘locus_tag’ ‘EC’ ‘KEGG’ ‘name’ ‘GPR’

  • Case ‘BioCyc’
    tuple: Five tables (1) - (4)
    1. pd.DataFrame: Gap fill statistics with the columns

      ‘Missing entity’ ‘Total’ ‘Have BiGG ID’ ‘Can be added’ ‘Notes’

    2. pd.DataFrame: Genes with the columns

      ‘locus_tag’ ‘protein_id’ ‘model_id’ ‘name’

    3. pd.DataFrame: Metabolites with the columns

      ‘bigg_id’ ‘name’ ‘BioCyc’ ‘compartment’ ‘Chemical Formula’ ‘InChI-Key’ ‘ChEBI’ ‘charge’

    4. pd.DataFrame: Reactions with the columns

      ‘bigg_id’ ‘name’ ‘BioCyc’ ‘locus_tag’ ‘Reactants’ ‘Products’ ‘EC’ ‘Fluxes’ ‘Spontaneous?’ ‘bigg_reaction’

  • Case ‘KEGG+BioCyc’:
    tuple: Five tables (1)-(4) from output of ‘BioCyc’ & (5) from output of ‘KEGG’

    -> Table reactions contains additionally column ‘KEGG’

b. add genes, metabolites and reactions from an Excel table to a model

refinegems.gapfill.gapfill_model(model_libsbml: libsbml.Model, gap_analysis_result: str | tuple) libsbml.Model

Main function to fill gaps in a model from a table

Args:
  • model_libsbml (libModel): Model loaded with libSBML

  • gap_analysis_result (str|tuple): Path to Excel file from gap_analysis|Tuple of pd.DataFrames obtained from gap_analysis

Returns:

libModel: Gap filled model

c. or perform gap analysis and add the result directly to a model.

Warning

To use the gap analysis and directly add the result to a model, currently, one of the options ‘BioCyc’ or ‘KEGG+BioCyc’ has to be selected. For all other options the usage of gap_analysis combined with gapfill_model will result in an error.

refinegems.gapfill.gapfill(model_libsbml: libsbml.Model, gapfill_params: dict[slice(<class 'str'>, <class 'str'>, None)], filename: str) tuple[pandas.DataFrame, libsbml.Model] | tuple[tuple, libsbml.Model]
Main function to fill gaps in a model by comparing the locus tags of the GeneProducts to | KEGG/BioCyc/(Genbank) GFF file
Args:
  • model_libsbml (libModel): Model loaded with libSBML

  • gapfill_params (dict): Dictionary obtained from YAML file containing the parameter mappings

  • filename (str): Path to output file for gapfill analysis result

  • gapfill_model_out (str): Path where gapfilled model should be written to

Returns:
tuple: gap_analysis() table(s) (1) & libSBML model (2)
  1. pd.DataFrame|tuple(pd.DataFrame): Result from function gap_analysis()

  2. libModel: Gap filled model

Relevant parameters

To perform the gap analysis the following parameters are relevant for the config.yaml file: (See Data acquisition from BioCyc on how to obtain the files 1 to 3)

gap_analysis: TRUE
    gap_analysis_params:
      db_to_compare: 'One of the choices KEGG|BioCyc|KEGG+BioCyc'
      organismid: 'KEGG Organism ID' # Needs to be specified for KEGG
      gff_file: 'Path to RefSeq GFF file' # Needs to be specified for KEGG
      biocyc_files:
        - 'File 1: Path to gene to reaction mapping table'
        - 'File 2: Path to reaction table'
        - 'File 3: Path to compounds table'
        - 'File 4: Path to protein FASTA file used as input for CarveMe'

To add genes, metabolites and reactions from an Excel table to a model the following parameters need to be set: (The Excel file is either obtained by running gapfill_analysis or created by hand with the same structure as the result file from gapfill_analysis. An example Excel file to fill in by hand can be found in the cloned repository under ‘data/modelName_gapfill_analysis_date_example.xlsx’)

gapfill_model: TRUE
    gap_analysis_file: 'Path to Excel file with which gaps in model should be filled'

Data acquisition from BioCyc

  1. If you have no BioCyc account you will need to create one. See BioCyc /> Create Free Account <https://biocyc.org/new-account.shtml> to create an account.

  2. Then you need to search for the strain of your organism.

  3. Within the database of your organism you need to click on Tools in the menu bar and select Special SmartTables under SmartTables. There you need to make an adjustable copy of each of the tables “All genes of <organism>” and “All reactions of <organism>”.

  4. For the gene to reaction mapping table:

    1. Remove all columns except ‘Gene Name’ from the “All genes of <organism>” table,

    2. then click choose a transform and select ‘Reactions of gene’,

    3. then add the property ‘Accession-2’

    4. and delete the ‘Gene Name’ column.

    5. After that select the column ‘Accession-2’ and use the filter function in the box on the right side of the page to delete all empty rows.

    6. Finally, click Export to Spreadsheet File from the box on the right side and choose Frame IDs.

  5. For the reactions table:

    1. Remove all columns except ‘Reaction’ from the “All reactions of <organism>” table,

    2. then click choose a transform:

      1. select ‘Reactants of reaction’,

      2. then select ‘Products of reaction’

    3. and then choose the property:

    1. ‘EC-Number’,

    2. then ‘Reaction-Direction’,

    3. and then ‘Spontaneous?’.

    1. Finally, click Export to Spreadsheet File in the box on the right side and choose Frame IDs.

  6. For the metabolites table:

    1. Use the MetaCyc database to get the table “All compounds of MetaCyc”.

    2. Remove all columns except ‘Compound’,

    3. then choose the property:

    1. ‘Object ID’,

    2. then ‘Chemical Formula’,

    3. then ‘InChI-Key’,

    4. and then ‘database links’ > ‘ChEBI’.

    1. Finally, click Export to Spreadsheet File in the box on the right side and choose common names.