A human genome contains genetic information of an organism as DNA sequences in the form of 23 chromosomes. And a single DNA molecule consists of two strands which are connected by four different bases (A, T, C, G).
The human genome consists of around 3 billion of these base pairs. So, if a base pair is considered as 2-bit combination then considering all the base pairs, a diploid cell would contain 1.5 GB of data. And humans contain around 100 trillion cells. The numbers are astounding.
Tasking a biomedical researcher with handling data which is not only inherently large but also comes with a multitude of combinations and classifications.
Add to this, there are frequent discoveries of drugs and proteins by academia.
All this information is stored in the form of tonnes of text. Skimming through this text for discoveries and deductions takes a lifetime. Though computers have made it easy to find information like a specific genome name but only in a naive way as the user has to possess the information prior to the search.
The researchers at Allen Institute of Artificial Intelligence came up with a new tool or a library by the name sciSpacy, developed specifically for biomedical or scientific text processing.
Most of the tools available today, deal with entity linking, abbreviation and negation detection. For traditional NLP tasks, there is GENIA. But these tools do not implement state-of-the-art word representations and neural networks.
Making A Room For Biomedical Applications With sciSpacy
In a paper titled scispaCy: Fast and Robust Models for Biomedical Natural Language Processing, the researchers introduce a specialised NLP library for processing biomedical texts, built on the spaCy library.
To emphasise the efficiency and practical utility of the end-to-end pipeline provided by scispaCy packages, a speed comparison is performed in comparison with several other publicly available processing pipelines for biomedical text using 10k randomly selected PubMed abstracts.
For training, the researchers used GENIA 1.0 corpus. This dataset has parts of speech tags annotated, which was used to train the parts of speech tagger jointly with the dependency parser.
The researchers have also included the PubMed metadata for the abstracts which was discarded in the GENIA corpus.
The original metadata includes relevant named entities of chemical and drugs associated to a variety of ontologies along with citation statistics and journal metadata.
For named entity recognition (NER) models, the training was done on the following datasets:
- BC5CDR – for chemicals and diseases
- CRAFT – for cell types, chemicals, proteins, genes
- JNLPBA – for cell lines, cell types, DNAs, RNAs, proteins and
- BioNLP13CG – for cancer genetics
Along with the datasets mentioned above, the researchers have also covered five more datasets such as Linnaeus and AnatEM for a variety of entity types which include cancer genetics, pathway analysis, trial population extraction etc.
Another key challenge with biomedical data is with its commonly occurring abbreviated names and noun compounds containing punctuation, which might lead to misidentification.
So, for evaluating sentence segmentation, both sentence and full-abstract accuracy were used.
Read more about the sciSpacy here
pip install scispacy
A Python code for carrying out entity recognition using ‘scispacy’:
nlp = spacy.load(“en_core_sci_sm”)
text = “””
Myeloid derived suppressor cells (MDSC) are immature
myeloid cells with immunosuppressive activity.
They accumulate in tumor-bearing mice and humans
with different types of cancer, including hepatocellular
doc = nlp(text)
>>> [“Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity.”,
“They accumulate in tumor-bearing mice and humans with different types of cancer, including hepatocellular carcinoma (HCC).”]
>>> (Myeloid derived suppressor cells,
- Sets a benchmark for named entity recognition models for more specific entity extraction applications and when compared to others.
- sciSpacy demonstrates a competitive performance by releasing and evaluating two fast and convenient pipelines for biomedical text, which include tokenisation, part of speech tagging, dependency parsing and named entity recognition.