Bioinformatics
Practice-Python
Chapter02

Chapter 02 : Accessing Databases

I am referring Github Bioinformatics repo to learn, as it contains all the code, along with that it starts from basics

This Code is all about retriving the data from NCBI database, using python module name Bio. I'm using google colab to run.

Install Bio Package

!pip install Bio

when the package is install successfully then import module name entrez, Medline, SeqIO to work.

from Bio import Entrez, Medline, SeqIO

Then we can fetch data from the database by query but before we begin we have to add our email id.

Entrez.email = "rudrajoshi@gmail.com"

Output: {'DbList': ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']}

Select database from following and then insert your query

handle = Entrez.esearch(db="nucletide",term="your Query")
rec_list = Entrez.read(handle) # always use read for the incomming req in python
print(rec_list)

By defaut the max retrival of documents wil be 20. so we have to retrive all the documents present in database. rec_list["count"] contains total number of the documents.

if rec_list['RetMax'] < rec_list['Count']:
    handle = Entrez.esearch(db="nucleotide", term='Your Query',
                            retmax=rec_list['Count'])
    rec_list = Entrez.read(handle)