Chapter 02 : Accessing Databases
I am referring Github Bioinformatics repo to learn, as it contains all the code, along with that it starts from basics
This Code is all about retriving the data from NCBI database, using python module name Bio. I'm using google colab to run.
Install Bio Package
!pip install Bio
when the package is install successfully then import module name entrez, Medline, SeqIO to work.
from Bio import Entrez, Medline, SeqIO
Then we can fetch data from the database by query but before we begin we have to add our email id.
Entrez.email = "rudrajoshi@gmail.com"
Output:
{'DbList': ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']}
Select database from following and then insert your query
handle = Entrez.esearch(db="nucletide",term="your Query")
rec_list = Entrez.read(handle) # always use read for the incomming req in python
print(rec_list)
By defaut the max retrival of documents wil be 20. so we have to retrive all the documents present in database.
rec_list["count"]
contains total number of the documents.
if rec_list['RetMax'] < rec_list['Count']:
handle = Entrez.esearch(db="nucleotide", term='Your Query',
retmax=rec_list['Count'])
rec_list = Entrez.read(handle)