The intersection of deep learning and genomics is rapidly evolving, giving us powerful tools to understand complex diseases like cancer. One such breakthrough was EMOGI (Explainable Multi-Omics Graph Integration), published in Nature Machine Intelligence by Schulter et al. EMOGI demonstrated how Graph Convolutional Networks (GCNs) could integrate multi-omics data (mutations, methylation, expression) with protein-protein interaction (PPI) networks to predict cancer genes.
Today, we’re exploring proEMOGI (Pytorch Representation of EMOGI), a modern implementation of this algorithm designed to leverage the flexibility and dynamic computation graph of PyTorch.
Why proEMOGI?
The original EMOGI repository was built using TensorFlow 1.x. While ground-breaking, the older framework can be challenging to maintain and debug in modern workflows.
proEMOGI brings this powerful method to the PyTorch ecosystem. Key advantages include:
- Dynamic Computation Graph: PyTorch’s eager execution makes debugging model behavior and dimensions significantly easier.
- Modern GNN Libraries: By utilizing
torch_geometric, proEMOGI benefits from optimized graph operations and a cleaner API. - Maintainability: A streamlined codebase (around 300 lines for the core logic vs. the extensive TF boilerplate) makes it easier for researchers to adapt and extend.
Technical Deep Dive
Let’s look at how proEMOGI implements the core concepts of the original paper.
3D Graph Convolutions
The heart of EMOGI is its ability to handle multiple omics features as “channels,” similar to how a CNN handles RGB images. In proemogi.py, the model is defined as a torch.nn.Module that accepts these multi-dimensional features.
class proEMOGI(torch.nn.Module):
def __init__(self, input_dim, output_dim, num_hidden_layers=2, ...):
super(proEMOGI, self).__init__()
# ... initialization code ...
self.layers = []
# Stacking GCN layers
inp_dim = self.input_dim
for l in range(self.num_hidden_layers):
self.layers.append(GCNConv(inp_dim, self.hidden_dims[l]))
inp_dim = self.hidden_dims[l]
self.layers.append(GCNConv(self.hidden_dims[-1], self.output_dim))
The forward pass is clean and intuitive, applying graph convolutions followed by ReLU activations and dropout for regularization:
def forward(self, x, edge_index, edge_weight=None):
for layer in self.layers[:-1]:
x = layer(x, edge_index, edge_weight)
x = F.relu(x)
if self.dropout_rate is not None:
x = F.dropout(x, self.dropout_rate, training=self.training)
# Final classification layer
x = self.layers[-1](x, edge_index, edge_weight)
return F.log_softmax(x, dim=1)
Data Handling with HDF5
Genomic datasets can be massive. proEMOGI respects the data format established by the original EMOGI project, using HDF5 containers for efficient storage and access. The gcnIO.py module handles this integration.
The load_hdf_data function acts as the bridge, extracting the network topology (adjacency matrix) and node features from the HDF5 file:
def load_hdf_data(path, network_name='network', feature_name='features'):
with h5py.File(path, 'r') as f:
network = f[network_name][:]
features = f[feature_name][:]
# ... extracting labels and masks ...
# Convert adjacency matrix to NetworkX graph for edge extraction
G = nx.from_numpy_array(network)
edge_pd = nx.to_pandas_edgelist(G)[['source', 'target']].T.values
return network, edge_pd, features, ...
This ensures that any dataset prepared for the original EMOGI implementation can be dropped directly into proEMOGI.
Training the Model
The training loop in train_proemogi.py is a standard PyTorch training cycle, but it includes specific handling for the graph structure.
- Data Loading: It reads the HDF5 file.
- Model Initialization: Instantiates the
proEMOGIclass with dimensions inferred from the data. - Optimization: Uses the Adam optimizer. Note the selective weight decay application—often in GNNs, weight decay is applied differently to certain layers to prevent over-smoothing.
# From train_proemogi.py
model = proEMOGI(input_dim=features.shape[1], ...).to(device)
# Training loop logic
def train(y, train_mask):
model.train()
optimizer.zero_grad()
out = model(features[train_mask], edge_list)
loss = F.nll_loss(out, y[train_mask])
loss.backward()
optimizer.step()
return loss.item()
Getting Started
To try proEMOGI yourself, you’ll need the repository and your data in the correct HDF5 format.
- Clone the Repo:
git clone https://github.com/Bibyutatsu/proEMOGI.git cd proEMOGI -
Install Dependencies: You’ll need PyTorch, PyTorch Geometric, h5py, and other standard data science libraries.
- Run Training:
python proEMOGI/train_proemogi.py -d path/to/your/data.h5
Conclusion
Re-implementing complex biological models like EMOGI in modern frameworks ensures they remain accessible and usable for the community. proEMOGI serves as both a functional tool for cancer gene prediction and a great example of how to structure a GNN project in PyTorch.
Check out the proEMOGI repository and the original EMOGI paper to learn more.