HOME ANALYSIS TUTORIAL DOWNLOAD CONTACT

HIV/AIDS Subtype Prediction Based on env Gene Sequence (HIV-1-M-SPBEnv)

 

The Human Immunodeficiency Virus, also known as HIV, is a retrovirus that causes deficiencies in the human immune system. This virus attacks and gradually destroys the human immune system, leaving the host unprotected during infection. People who are infected with HIV and pass away often die from secondary infections or cancer. AIDS is the final stage of HIV infection.

There are two main types of the AIDS virus: HIV-1 and HIV-2. HIV-1 originated in the area around the Congo Basin in Africa and is the most prevalent strain globally. It is responsible for about 95% of all HIV infections. HIV-2 is primarily found in West Africa, although it also affects a small number of people in Europe, India, and the United States.

HIV-1 can be divided into 4 groups: M, N, O and P, and group M is the most widely distributed worldwide, which is divided into 12 different subtypes or sub-subtypes, namely A1, A2, B, C, D, F1, F2, G, H, J, K, and L. The formation of various subtypes or sub-subtypes of HIV-1 group M is the result of continuous molecular evolution. Correct classification of subtypes or sub-subtypes is important for vaccine design, therapeutic drugs, and effective prevention and control of AIDS in the global community.

Precise M subtype or sub-subtype classification relies on phylogenetic analysis of specific gene sequences. In the past, subtypes were roughly determined through homologous searches in the NCBI database. The accuracy depended on the searcher's judgment level, and sometimes even the correct judgment could not be made. Classification of HIV-1 subtype based on statistical modeling methods has also been developed, but due to the small sample size of some subtypes, this has caused great limitations in the tools derived from statistical modeling. Therefore, we developed a deep learning-based method, which we named HIV-1-M-SPBEnv.

To address the issue of samples being scarce for some subtypes, we successfully used artificial genetic mutation methods to synthesize new machine learning samples, thereby cleverly solving the problem of insufficient samples for some subtypes.

Due to the rapid evolution rate of the env gene, we chose to model the env gene sequence samples for the classification of HIV-1 subtypes. We download env gene sequences at the HIV Sequence Database supported by Los Alamos National Laboratory (https://www.hiv.lanl.gov/content/index). We downloaded the 2021 version of the env gene sequence, which is the latest version so far, with a total sample size of 5,310 (Table 1)

Table 1 The original data set of the 12 subtypes of HIV env DNA sequences

Subtype Sample Size Subtype Sample Size
A1 311 F2 16
A2 5 G 136
B 2,887 H 10
C 1,717 J 5
D 145 K 2
F1 73 L 3

In our deep learning model framework, we use the Kmer method to vectorize DNA sequences.

Our model training adopts a unique strategy. The function of the Autoencoder is to extract high-dimensional feature information from DNA sequences. There are two training cycles, including Autoencoder training cycle and classifier training cycle. In the Autoencoder training cycle, I train the Autoencoder with the criterion of making the output reconstruction loss approach zero, thereby enabling the output signal to completely reconstruct the input signal, while in the classifier training cycle, the goal is to train a classifier to minimize its training loss, thereby achieving higher classification accuracy. To this end, the two loss function values are added together to obtain a total loss function, and this total loss function is made to approach zero after the completion of training; this serves as the training criterion for the entire model. With the total loss approaching zero, the Autoencoder's reconstruction loss also tends towards zero, then achieving a perfect reconstruction of the Autoencoder's output to its input. Consequently, the encoder's output thus perfectly and losslessly extracts high-dimensional feature information of DNA sequences, representing the high-dimensional information of DNA sequences with low-dimensional information. Subsequently, the output signal of the encoder is fed into the full connected neural network block, which is used for the classification task. The architecture of HIV-1-M-SPBEnv is shown in Figure 1..

Using an independent validation dataset, the accuracy of HIV-1-M-SPBEnv reached 100%, demonstrating strong model generalization capabilities. The source code of our HIV-1-M-SPBEnv is shared with academic users on Github at https://github.com/pengsihua2023/HIV-1-M-SPBEnv, and the trained model is deployed at this web site.

Figure 1. An illustration of HIV-1-M-SPBEnv architecture.