MHOLline 2.0: WORKFLOW FOR AUTOMATIC LARGE-SCALE MODELING AND ANALYSIS OF PROTEINS MHOLline 2.0: WORKFLOW PARA MODELAGEM E ANÁLISE AUTOMÁTICA DE PROTEÍNAS EM LARGA ESCALA

Genomes and their proteins can be analyzed by different perspectives, with different goals and applicability of results in several areas of research. In bioinformatics and computational biology, it is common the use of multiple combinations of programs and databases to extract information from the raw data. Choosing the proper program, setting its parameters, filtering the input files, extracting information from output files, and creating scripts for automating tasks can be challenging. This work describes the MHOLline, an online scientific workflow that provides a set of modules that enables a broad large-scale analysis of proteins (e.g. whole genomes) in a few hours. The version 1.0 (released in 2010), was wholly reformulated, and the new version 2.0 is available at www.mholline2.lncc.br. This version presents new features, modules, interface, code optimization, runtime reduction, more security, a new results visualization interface, and an automatic and userfriendly refinement page.


INTRODUCTION
Genome analysis may provide information about each protein's role in organisms, which proteins play similar or analogous biological functions, and identify proteins intrinsic to a particular specie. It also enables the search for new molecular targets for the treatment of diseases and assists in the structure-based drug design (SBDD).
The blind screening method used in drug development is responsible for creating the majority of drugs available. It consists of testing a library of compounds against in vivo and in vitro systems. The costs to purify, characterize, and synthesize these compounds associated with the need to have many compounds in a library are hard-working and highly expensive, spending billions of dollars (ADAMS; BRATNER, 2006;KOLA;LANDIS, 2004;DIMASI;GRABOWSKI;HANSEN, 2016).
Therefore, computational methods have been proposed to decrease costs and speed up the drug design process (e.g., the SBDD) (LOUNNAS et al., 2013). The more is known about the three-dimensional (3D) structure of proteins, the faster the drug discovery process becomes (CONGREVE;MURRAY;BLUNDELL, 2005;MANDAL;MOUDGIL;MANDAL, 2009). In this context, the MHOLline workflow was created to automate (i) the analysis of structural properties of proteins, (ii) the construction of 3D models of proteins aiming to achieve a large number of protein models in an easy way and in less time, accelerating the SBDD studies, and (iii) the validation of 3D models (CAPRILES et al., 2010).
Over the past ten years, MHOLline has contributed to dozens of researches in the different areas of Computational Modeling (JAISWAL et al., 2017;JAMAL et al., 2017;HASSAN et al., 2018;VASCONCELOS;CAMPOS;REZENDE, 2018;ALMEIDA et al., 2019;GONÇALVES et al., 2019). In this paper, we present the second version of this workflow, with the addition of new softwares, proprietary tools for protein modeling analysis and refinement, and administration environment.
MHOLline is a scientific workflow developed to aid researchers from Bioinformatics, Biophysics, Computational Chemistry, and Computational Biology. In 2010, the partnership between Universidade Federal do Rio de Janeiro (UFRJ) and Laboratório Nacional de Computação Científica (LNCC) resulted in the first web version of MHOLline, available in the electronic address www.mholline.lncc.br (CAPRILES et al., 2010;GUIMARÃES et al., 2013).
This workflow has been defined as a computing environment divided into three parts ( Figure 1): (i) MHOLcore, that generates, processes and maintains all data; (ii) MHOLweb, the graphical user interface that displays the progress of jobs and their results; and (iii) MHOLdb, Revista Mundi, Engenharia e Gestão, Paranaguá, PR, v. 5, n. 6, p. 283-01, 283-14, 2020 DOI: 10.21575/25254782rmetg2020vol5n61325 the database used to store and query all the system data. MHOLline was developed for Linux OS. MHOLcore uses Perl, ShellScript, C, and Python2/3, MHOLdb database uses MySQL, and MHOLweb uses PHP, JavaScript, HTML, and CSS. There are two primary access types: with or without login. The user without login can submit new jobs, but it is limited to 50 amino acid sequences per file in FASTA format. The user with a login could be classified into three types of logged user: (i) Registered, can submit and manage its jobs without the restriction of protein sequences per file; (ii) Manager, has some administrative privileges over Registered user to help the system management; (iii) Administrator, has all privileges to maintenance of server, database, statistical analysis of MHOLline usage and query base tools that keep MHOLline updated and working 1.1 What's New in MHOLline 2.0 The first version of MHOLline (CAPRILES et al., 2010) presented the following modules: (i) the BLAST program (ALTSCHUL et al., 1990), used to perform the local alignment of the submitted sequences against the Protein Data Bank (PDB) (BERMAN et al., 2000); (ii) the BATS tool, developed to classify the alignments in four groups (Table 1) based the E-value and identity from BLAST, and Length Variation Index (LVI) which is the MHOLline's concept of coverage (LVI ≤ 0.1 is equivalent to coverage ≥ 90%); (iii) the FILTERS tool, created to sort G2 proteins (proteins that can be modeled by comparative modeling technique) into seven quality groups (Table 2); (iv) the ECNGet tool, developed to capture at least one Enzyme Commission (EC) number for each sequence in G2 group, whose reference protein has at least one known enzymatic function; (v) the MODELLER software (WEBB; SALI, 2016) that constructs the 3D models; (vi) the PROCHECK software (LASKOWSKI et al., 1993), used to produce Ramachandran plots; and (vii) the HMMTOP software (TUSNÁDY; SIMON, 2001), used to identify transmembrane regions in proteins.
In this new version, MHOLline has undergone the following modifications: (i) updated modules to latest versions; (ii) addition of new softwares; (iii) code refactoring to make Revista Mundi, Engenharia e Gestão, Paranaguá, PR, v. 5, n. 6, p. 283-01, 283-14, 2020 DOI: 10.21575/25254782rmetg2020vol5n61325 workflow more readable and faster; (iv) security improvements; (v) creation of an online view mode where user can analyze the job results; (vi) development of a refinement system where the logged user can edit the alignment and resubmit it to the workflow also using some structural restrictions in the 3D model construction; (vii) development of a re-upload tool for data reuse, allowing data re-analysis and refinement without having to pass through the workflow processing again, saving the user time; and (viii) creation of a new interface, taking into account the layout and operational compatibility with main browsers, according to the concept of human-computer interaction.
In the system administration area, we incorporated a tool to user and system activity data collection, which are used to generate graphs, tables, and reports, providing the administrator with information to design future modifications and keep records of workflow activities. All this data is saved into the database, being possible to retrieve information from previous periods and compare it to the current scenario. Additionally, we developed functions to backup, restore, and update modules and database that runs directly from the system interface via the administrator environment. These services facilitated remote system administration and maintenance without access via a remote network server.  Additionally, logged users can improve or reconstruct 3D models and download the original MHOLline's results or the last two modified results through the refinement process. Allowed improvements are: (i) change the template chosen by default or add up to two more structures; (ii) cleavage of the protein in positions that were predicted by the SignalP program; (iii) use of secondary structures constraints during model construction via MODELLER; and (iv) optimize loops with MODELLER. It is important to note that when the user changes or add new templates if only one structure is selected, the workflow executes the salign function from MODELLER; otherwise, executes CLUSTALΩ software (SIEVERS et al., 2011).
The logged user can also download the alignment, perform manual adjustments in this Revista Mundi, Engenharia e Gestão, Paranaguá, PR, v. 5, n. 6, p. 283-01, 283-14, 2020 DOI: 10.21575/25254782rmetg2020vol5n61325 file, and then re-upload this data to be used in the refinement tool. The interface provides information regarding the protein model and templates (Figure 3), such as the secondary structure of the sequence model predicted with PSIPRED and the template files assigned by DSSP and transmembrane regions found in the sequence model using TMHMM. Figure 3: Sample of information provided in the alignment section of the refinement interface.
In the figure, the characters over the sequences means (c) coil, (h) alpha-helix, and (e) beta structure.
During the re-submission, the logged user might choose either Manual or Automatic refinement. The Automatic mode will select the secondary structures, based on PSIPRED prediction, which occurs in template gap regions, using different combinations of size and position, in an attempt to increase the variety of 3D models. This process may generate up to 50 models. If the number of possibilities exceeds 50 combinations, the system will randomly choose 50 and then model the selected ones. If there are no gaps nor alpha/beta secondary structure prediction in gapped regions, the system will optimize loops with MODELLER generating 20 models.
Aiming to minimize the requirements to analyze a large amount of data generated in the Automatic refinement, we implemented a library written in python3 called Quality-Model Clustering (QMC) Tool to cluster proteins in groups based on its energy and stereochemical qualities (CARVALHO; ROSSI; GOLIATT, 2020). This tool takes into account the MODELLER scores (molpdf, DOPE, DOPE-HR, GA341, and N-DOPE), and Molprobity results to group the structures by similarity, returning to the logged user the clusters with their representative structures. The QMC tool is freely available to download in the git repository Revista Mundi, Engenharia e Gestão, Paranaguá, PR, v. 5, n. 6, p. 283-01, 283-14, 2020 DOI: 10.21575/25254782rmetg2020vol5n61325 (github.com/ruanmedina/Quality-Model-Clustering).
The Manual refinement will perform the re-modeling based on user inputs (Figure 4), generating 20 models. As in the Automatic process, the input data should also consider modifications in the alignment, secondary structure constraints, and template change. In this case, the MHOLline 2.0 refinement tool will not run the protein cluster library (QMC) since the user chooses the refinement. To validate both the Manual and Automatic refinement processes, we performed three tests: the Automatic refinement, the Manual refinement, and without any restriction. The protein used in this validation process was the ATP-diphosphohydrolase 2 (Uniprot code: A1BXT9) of Schistosoma mansoni, also called SmNTPDase2 (LEVANO-GARCIA et al., 2007). This sequence was submitted to the MHOLline 2.0 workflow, which selected as the template the structure of nucleoside triphosphate diphosphohydrolase 1 (NTPDase1) of Rattus norvegicus (PDB code: 3ZX3) (ZEBISCH et al., 2012), presenting 33% (68/204) of identity and LVI of 0.57. To evaluate the Automatic and the Manual refinements within gap regions, we changed the template by the NTPDase1 of Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513) (PDB code: 4BRP -chain A) (ZEBISCH et al., 2013), with 28% of identity and LVI of 0.53. The objective was to obtain alignment with lower coverage and identity to be able to apply structural restrictions. The system chose the Revista Mundi, Engenharia e Gestão, Paranaguá, PR, v. 5, n. 6, p. 283-01, 283-14, 2020 DOI: 10.21575/25254782rmetg2020vol5n61325 restrictions applied in the automatic approach, and Table 3 presents the constraints used in the manual procedure. In both tests, we decided to clive the region detected as a signal peptide, and not used contact and beta-sheet restrictions, nor loop optimizations. The QMC Tool filtered each set (Automatic, Manual, and No Restriction), performing a Principal Component Analysis (PCA) over MODELLER and Molprobity attributes. The dimensionality of the data is reduced from seven to two components, enough to explain more than 85% of the original data (86.8% of Automatic, 87.8% of Manual, and 92.0% of No Restriction). We selected the best model of each set by the visual inspection of each cluster's representative model. Figure 5 presents the final three best models (one from each set) aligned with the control structure, the 3D model of SmNTPDase2, previously published by (SOUZA et al., 2014). Table 4 presents the backbone Root Mean Square Deviation (RMSD) calculated using the RMSD plugin of PyMol (Schrödinger, LLC, 2015). Table 4 the Manual refinement achieved a smaller RMSD value, followed by the Automatic approach indicating that the use of any refinement tool might lead to better results than not applying structural improvements. Figure 5 shows that Automatic refinement (Fig. 6b), even not being as good as the Manual (Fig. 6c), has been capable of optimizing the protein structure.  The DOPE profile has been computed for each structure (models, template, and reference) and plotted in Figure 6 to illustrate the differences between the DOPE energy for each residue. It is possible to notice that in regions where restraints were applied, the models' energy profile was similar to the SmNTPDase2, for example, in the alpha-helix constraints between residues 97 -106 and 109 -112, and beta-strand constraints between residues 182 -183 and 242 -251. The C-terminal coils present different energy values due to their high flexibility, allowing the structure to assume different conformations in this region. This problem can be minimized by using contact restrictions.

As observed in
A single FASTA file submission with a large number o sequences (e.g., entire genomes), can spend a lot of processing time and generates large files as a result. For this reason, the jobs are automatically deleted 15 days after finishing its processing in the main workflow, saving hard disk space. In case that users need more than 15 days to process their results, a datareuse tool was created, and logged users can re-upload their original .tar.gz file (previously downloaded) to the system (SIQUEIRA, 2018).
The user re-upload the previous job file, and MHOLline checks the encrypted key to verify data integrity. Then the user can automatically compare if the re-uploaded data are the same as they were before the deletion and check if the re-uploaded files represent the same data of finished jobs in the system (whether the user has not previously deleted the job). Finally, the Revista Mundi, Engenharia e Gestão, Paranaguá, PR, v. 5, n. 6, p. 283-01, 283-14, 2020 DOI: 10.21575/25254782rmetg2020vol5n61325 tool inserts files in its original folders, restores the data in MHOLdb, and sets again as active.
Thus the user will have more 15 days to continue the analyzes.

CONCLUSIONS
The new version of MHOLline features both system improvements and analysis tools. We refactored the system, adding features and standardizing the code and documentation and removing vulnerabilities. Unified Modeling Language (UML) diagrams have been generated to help the knowledge of each step of the workflow, from MHOLcore to MHOLweb. The web interface of MHOLline has been completely rewritten, and a new layout for the result page was developed to make the user experience even better. The main difference of new MHOLweb is the refinement page in which the user can manually or automatically modify and optimize the 3D model constructed by MHOLline.
The option to change or add more templates proved to be a significant improvement in the workflow. Sometimes, the best BLAST hit may not be the best modeling template due to its original function or organism. Now, users have the option to change the default template to best suit their needs. Additionally, the ability to indicate to MODELLER where to set secondary Revista Mundi, Engenharia e Gestão, Paranaguá, PR, v. 5, n. 6, p. 283-01, 283-14, 2020 DOI: 10.21575/25254782rmetg2020vol5n61325 structure constraints based on PSIPRED predictions and user knowledge is also a significant improvement since it gives more user autonomy to model the protein as desired.
The objective of creating new tools and enhancing existing ones in the workflow was to ensure the correct functioning of the process, renewing and innovating based on current programming practices, project management approaches, computational modeling, and information security.
The latest version of MHOLline can be accessed at www.mholline2.lncc.br.

PERSPECTIVES
As the next steps, we are planning to implement the second phase of MHOLline code optimization. The modules will have their docker containers, allowing administrator users to have more control over all processes running in the MHOLline server. Other computational biology softwares will be implemented to expand the capability of MHOLline 2.0 to provide information about proteins, such as residue-residue contact prediction, which can also be used to improve the refinement tool.