Treffer: Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data.

Title:
Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data.
Authors:
Draizen EJ; Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA. edraizen@gmail.com.; School of Data Science, University of Virginia, Charlottesville, VA, USA. edraizen@gmail.com., Readey J; The HDF Group, Bellevue, WA, USA., Mura C; Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA. cmura@virginia.edu.; School of Data Science, University of Virginia, Charlottesville, VA, USA. cmura@virginia.edu., Bourne PE; Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.; School of Data Science, University of Virginia, Charlottesville, VA, USA.
Source:
BMC bioinformatics [BMC Bioinformatics] 2024 Jan 04; Vol. 25 (1), pp. 11. Date of Electronic Publication: 2024 Jan 04.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: BioMed Central Country of Publication: England NLM ID: 100965194 Publication Model: Electronic Cited Medium: Internet ISSN: 1471-2105 (Electronic) Linking ISSN: 14712105 NLM ISO Abbreviation: BMC Bioinformatics Subsets: MEDLINE
Imprint Name(s):
Original Publication: [London] : BioMed Central, 2000-
References:
PLoS One. 2017 Mar 31;12(3):e0174846. (PMID: 28362865)
Science. 2023 Mar 17;379(6637):1123-1130. (PMID: 36927031)
Curr Protoc Bioinformatics. 2016 Jun 20;54:5.6.1-5.6.37. (PMID: 27322406)
Bioinformatics. 2002 Jul;18(7):980-4. (PMID: 12117796)
BMC Bioinformatics. 2019 Jun 11;20(1):311. (PMID: 31185886)
Sci Data. 2016 Mar 15;3:160018. (PMID: 26978244)
Protein Sci. 2018 Jan;27(1):112-128. (PMID: 28836357)
Bioinformatics. 2023 Jan 1;39(1):. (PMID: 36420989)
Methods Enzymol. 1997;277:571-90. (PMID: 18488325)
Nucleic Acids Res. 2007 Jul;35(Web Server issue):W522-5. (PMID: 17488841)
J Chem Inf Model. 2020 Apr 27;60(4):2356-2366. (PMID: 32023053)
Nat Rev Genet. 2022 Mar;23(3):169-181. (PMID: 34837041)
Nucleic Acids Res. 2021 Jul 2;49(W1):W597-W602. (PMID: 34019658)
Nat Biotechnol. 2017 Apr 11;35(4):314-316. (PMID: 28398314)
Brief Bioinform. 2016 Sep;17(5):831-40. (PMID: 26411473)
J Mol Biol. 1982 May 5;157(1):105-32. (PMID: 7108955)
Nucleic Acids Res. 2021 Jan 8;49(D1):D437-D451. (PMID: 33211854)
PLoS One. 2017 May 11;12(5):e0177459. (PMID: 28494014)
Nucleic Acids Res. 2021 Jul 2;49(W1):W624-W632. (PMID: 33978761)
Nat Methods. 2018 Oct;15(10):816-822. (PMID: 30250057)
F1000Res. 2016 Feb 18;5:189. (PMID: 26973785)
Proteins. 2009 Dec;77(4):778-95. (PMID: 19603484)
IUCrJ. 2014 May 30;1(Pt 4):213-20. (PMID: 25075342)
Nucleic Acids Res. 2022 Jan 7;50(D1):D439-D444. (PMID: 34791371)
Nucleic Acids Res. 2021 Jan 8;49(D1):D298-D308. (PMID: 33119734)
Acta Crystallogr A. 2008 Jan;64(Pt 1):88-95. (PMID: 18156675)
Nature. 2021 Aug;596(7873):583-589. (PMID: 34265844)
Curr Opin Struct Biol. 2021 Apr;67:170-177. (PMID: 33338762)
PLoS Comput Biol. 2010 Aug 26;6(8):. (PMID: 20865174)
BMC Bioinformatics. 2011 Feb 25;12:61. (PMID: 21352538)
Proteins. 2021 Nov;89(11):1489-1496. (PMID: 34213059)
Science. 2022 Oct 7;378(6615):49-56. (PMID: 36108050)
PLoS Comput Biol. 2017 Jul 26;13(7):e1005659. (PMID: 28746339)
Nucleic Acids Res. 2021 Jul 2;49(W1):W535-W540. (PMID: 33999203)
Nat Struct Biol. 1996 Oct;3(10):842-8. (PMID: 8836100)
Nature. 2005 Jan 27;433(7024):377-81. (PMID: 15674282)
PLoS Comput Biol. 2017 Jun 2;13(6):e1005575. (PMID: 28574982)
Biopolymers. 1983 Dec;22(12):2577-637. (PMID: 6667333)
Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. (PMID: 9254694)
BMC Genomics. 2008 Sep 16;9 Suppl 2:S2. (PMID: 18831785)
PLoS Comput Biol. 2018 Apr 30;14(4):e1006104. (PMID: 29708963)
Nat Commun. 2024 Sep 16;15(1):8094. (PMID: 39294145)
Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489. (PMID: 33237286)
Nat Commun. 2021 Dec 3;12(1):7068. (PMID: 34862392)
Nucleic Acids Res. 2021 Jan 8;49(D1):D266-D273. (PMID: 33237325)
F1000Res. 2018 Dec 20;7:1961. (PMID: 30705752)
Nucleic Acids Res. 2005 Apr 22;33(7):2302-9. (PMID: 15849316)
J Mol Biol. 1973 Sep 15;79(2):351-71. (PMID: 4760134)
PLoS Biol. 2022 Dec 12;20(12):e3001901. (PMID: 36508416)
Bioinformatics. 2017 Oct 01;33(19):3036-3042. (PMID: 28575181)
Grant Information:
Presidential Fellowship University of Virginia; MCB-1350957 National Science Foundation, United States
Contributed Indexing:
Keywords: Deep learning; Machine learning; Massively parallel workflows; Protein structure; Structural bioinformatics
Substance Nomenclature:
0 (Proteins)
Entry Date(s):
Date Created: 20240104 Date Completed: 20240108 Latest Revision: 20241019
Update Code:
20250114
PubMed Central ID:
PMC10768222
DOI:
10.1186/s12859-023-05586-5
PMID:
38177985
Database:
MEDLINE

Weitere Informationen

Background: Machine learning (ML) has a rich history in structural bioinformatics, and modern approaches, such as deep learning, are revolutionizing our knowledge of the subtle relationships between biomolecular sequence, structure, function, dynamics and evolution. As with any advance that rests upon statistical learning approaches, the recent progress in biomolecular sciences is enabled by the availability of vast volumes of sufficiently-variable data. To be useful, such data must be well-structured, machine-readable, intelligible and manipulable. These and related requirements pose challenges that become especially acute at the computational scales typical in ML. Furthermore, in structural bioinformatics such data generally relate to protein three-dimensional (3D) structures, which are inherently more complex than sequence-based data. A significant and recurring challenge concerns the creation of large, high-quality, openly-accessible datasets that can be used for specific training and benchmarking tasks in ML pipelines for predictive modeling projects, along with reproducible splits for training and testing.
Results: Here, we report 'Prop3D', a platform that allows for the creation, sharing and extensible reuse of libraries of protein domains, featurized with biophysical and evolutionary properties that can range from detailed, atomically-resolved physicochemical quantities (e.g., electrostatics) to coarser, residue-level features (e.g., phylogenetic conservation). As a community resource, we also supply a 'Prop3D-20sf' protein dataset, obtained by applying our approach to CATH . We have developed and deployed the Prop3D framework, both in the cloud and on local HPC resources, to systematically and reproducibly create comprehensive datasets via the Highly Scalable Data Service ( HSDS ). Our datasets are freely accessible via a public HSDS instance, or they can be used with accompanying Python wrappers for popular ML frameworks.
Conclusion: Prop3D and its associated Prop3D-20sf dataset can be of broad utility in at least three ways. Firstly, the Prop3D workflow code can be customized and deployed on various cloud-based compute platforms, with scalability achieved largely by saving the results to distributed HDF5 files via HSDS . Secondly, the linked Prop3D-20sf dataset provides a hand-crafted, already-featurized dataset of protein domains for 20 highly-populated CATH families; importantly, provision of this pre-computed resource can aid the more efficient development (and reproducible deployment) of ML pipelines. Thirdly, Prop3D-20sf's construction explicitly takes into account (in creating datasets and data-splits) the enigma of 'data leakage', stemming from the evolutionary relationships between proteins.
(© 2024. The Author(s).)