Large Matrices creation

Along with the network and the geneset, some analyses require an additional large matrix to be passed as input. In particular, the analysis of shortest path and diffusion are evaluating a matrix of shape \(NxN\) (N being the number of nodes), since those are invariant to the geneset analysed, they must be evaluated and saved only once.

Shortest Paths matrix

Build a shortest path distance matrix for a given network. Matrix can be saved as a .txt file or a .hdf5 one.

usage: pygna build-distance-matrix [-h] [-g] network-file output-file

positional arguments:
  network-file          network file
  output-file           distance matrix output file, use .hdf5

optional arguments:
  -h, --help            show this help message and exit
  -g, --giant-component-only
                        compute the shortest paths only for nodes in the giant component (default: True)

Diffusion matrix

To evaluate the diffusion matrix we have the below function that implements a Randowm Walk with Restart algorithm. The \(beta\) parameter is set to 0.80 as default, but can be given by the user.

usage: pygna build-rwr-diffusion [-h] [-b BETA] [-o OUTPUT_FILE] network-file

positional arguments:
  network-file          network file

optional arguments:
  -h, --help            show this help message and exit
  -b BETA, --beta BETA  0.85
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        distance matrix output file (use .hdf5) (default: -)

Converting tables and names

Dataset from table

Converts a csv file to a GMT allowing to filter the elements using the values of one of the columns. The user can specify the column used to retrieve the name of the objects and the filter condition. The output can be either a GMT with the names of the genes that pass the filter or a csv with the whole filtered table, otherwise both can be created.

usage: pygna geneset-from-table [-h] [--output-gmt OUTPUT_GMT] [--output-csv OUTPUT_CSV] [-n NAME_COLUMN] [-f FILTER_COLUMN] [-a ALTERNATIVE] [-t THRESHOLD]
                            [-d DESCRIPTOR]
                            input-file setname

positional arguments:
  input-file            input csv file
  setname               name of the set

optional arguments:
  -h, --help            show this help message and exit
  --output-gmt OUTPUT_GMT
                        output gmt name (default: -)
  --output-csv OUTPUT_CSV
                        output csv name (default: -)
  -n NAME_COLUMN, --name-column NAME_COLUMN
                        column with the names (default: 'Unnamed: 0')
  -f FILTER_COLUMN, --filter-column FILTER_COLUMN
                        column with the values to be filtered (default: 'padj')
  -a ALTERNATIVE, --alternative ALTERNATIVE
                        alternative to use for the filter, with less the filter is applied <threshold, otherwise >= threshold (default: 'less')
  -t THRESHOLD, --threshold THRESHOLD
                        threshold for the filter (default: 0.01)
  -d DESCRIPTOR, --descriptor DESCRIPTOR
                        descriptor for the gmt file (default: -)

Convert gene names

convert-gmt is used to convert a GMT file, adding information about the Entrez ID or the symbol

usage: pygna convert-gmt [-h] [-e ENTREZ_COL] [-s SYMBOL_COL] gmt-file output-gmt-file conversion converter-map-filename

positional arguments:
  gmt-file              gmt file to be converted
  output-gmt-file       output file
  conversion            e2s or s2e
  converter-map-filename
                        tsv table used to convert gene names

optional arguments:
  -h, --help            show this help message and exit
  -e ENTREZ_COL, --entrez-col ENTREZ_COL
                        name of the entrez column (default: 'NCBI Gene ID')
  -s SYMBOL_COL, --symbol-col SYMBOL_COL
                        name of the symbol column (default: 'Approved symbol')

generate-group-gmt generates a GMT file of multiple setnames. From the table file, it groups the names in the group_col (the column you want to use to group them) and prints the genes in the name_col. Set the descriptor according to your needs

usage: pygna generate-group-gmt [-h] [-n NAME_COL] [-g GROUP_COL] [-d DESCRIPTOR] input-table output-gmt

positional arguments:
  input-table           table to get the geneset from
  output-gmt            output GMT file

optional arguments:
  -h, --help            show this help message and exit
  -n NAME_COL, --name-col NAME_COL
                        'Gene'
  -g GROUP_COL, --group-col GROUP_COL
                        'Cancer'
  -d DESCRIPTOR, --descriptor DESCRIPTOR
                        'cancer_genes'

convert-csv is used to add a column with the entrezID or Symbols to a CSV file.

usage: pygna convert-csv [-h] [--converter-map-filename CONVERTER_MAP_FILENAME] [--output-file OUTPUT_FILE] [-e ENTREZ_COL] [-s SYMBOL_COL]
                     csv-file conversion original-name-col new-name-col geneset

positional arguments:
  csv-file              csv file where to add a name column
  conversion            e2s or s2e
  original-name-col     column name to be converted
  new-name-col          name of the new column with the converted names
  geneset               the geneset to convert

optional arguments:
  -h, --help            show this help message and exit
  --converter-map-filename CONVERTER_MAP_FILENAME
                        tsv table used to convert gene names (default: 'entrez_name.tsv')
  --output-file OUTPUT_FILE
                        if none, table is saved in the same input file (default: -)
  -e ENTREZ_COL, --entrez-col ENTREZ_COL
                        name of the entrez column (default: 'NCBI Gene ID')
  -s SYMBOL_COL, --symbol-col SYMBOL_COL
                        name of the symbol column (default: 'Approved symbol')