AI has inaugurated a new era in Bioinformatics, to the point where contemporary language models can extract structural information from processing single protein sequences. Contributing to this field, we built TintiNet.jl, a 100% open-source, open-data and Julia-based portable language model to predict 1D protein structural properties. Our model achieves top performance - computational and predictive -, when compared to other modern algorithms, with only a fraction of the parameter count.
The objective of TintiNet.jl is to improve the current state of single-sequence-based prediction of 1D protein structural properties by drastically reducing the size of the models employed while preserving or improving their raw predictive power.
Our main design principles were to avoid intra-serialized processing layers (such as recurrent neural networks) and to employ encoding layers that could grow deeper without a steep increase in computational complexity. Our solution was to develop a hybrid convolutional-transformer architecture, employing the Julia Language, The Flux.jl framework and the Transformers.jl contributed layers to Flux.jl, as well as some BioJulia packages (BioSequences.jl, BioStructures.jl, BioAlignments.jl and FASTX.jl). The project is 100% open-source and open-data, and scripts and procedures to implement the methodology presented are available at https://github.com/hugemiler/TintiNet.jl.
By training and evaluating our model in an extensive collection of over 30000 protein sequences, we demonstrate that this architecture can achieve a similar degree of merit (classification accuracy and regression error) when compared to the three most modern, state-of-the-art models. Since it has a much reduced number of parameters compared to its alternatives, it occupies much less memory and generates predictions up to 10 times faster.