Geared Spacy: Building NLP pipeline in RedisGears
Initially, I only planned to use Redis Cluster as one large in-memory store/database, store all article ids in the set corresponding to the pipeline step and then run: for each article_id in a set, apply next step in the pipeline, save id into next set. Potentially explore nomad to distribute compute.
Next few steps are memory demanding and I mean it, but the temptation to use RedisGears to distribute calculation over Redis Cluster with no effort on my side was too great. I even didn’t finish watching Tutorial, when I started coding (hint: watch the tutorial first). Inspired by the quick success of language detection I decided to add spacy and small English model into dependencies:
langdetect==1.0.8
spacy
https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz
Doesn’t look much — spacy is 10MB download and smallest English model 90 Mb.
The picture is different if you import both:
Spacy consumes(reserves) 1GB RAM just on import model (small one), above is a plot of simplest spacy test:
import spacy
nlp=spacy.load('en_core_web_md')
it will only raise to 1.2 GB if I start processing (with one item). At this stage in NLP pipeline, I only need to turn paragraphs into sentences and I want to use only Dependency Parser from spacy, so I will disable the rest:
nlp=spacy.load(‘en_core_web_md’, disable=[‘ner’,’tagger’])
Memory consumption didn’t change — it’s still 1.2 GB RAM for processing paragraphs, the question will RedisGears cope?
Redis trivia: do you know how large Redis source code and binary?
It coped surprisingly well:
import spacy
nlp=spacy.load('en_core_web_md', disable=['ner','tagger'])
nlp.max_length=2000000
def remove_prefix(text, prefix):
return text[text.startswith(prefix) and len(prefix):]
def parse_paragraphs(x):
key_prefix="paragraphs:"
#make sure we only process english article
lang=execute('GET', 'lang_article:' + x['key'])
if lang=='en':
paragraphs =x['value']
doc=nlp(paragraphs)
idx=1
article_id=remove_prefix(x['key'],key_prefix)
for each_sent in doc.sents:
sentence_key=f"sentences:{article_id}:{idx}"
execute('SET', sentence_key, each_sent)
idx+=1
execute('SADD','processed_docs_stage2_sentence', article_id)
else:
execute('SADD','screw_ups', x['key'])
gb = GB()
gb.foreach(parse_paragraphs)
gb.count()
gb.run('paragraphs:*')
The simple script above submitted via redis-cli into cluster
SECONDS=0
gears-cli --host 10.144.83.129 --port 6379 spacy_sentences_geared.py --requirements requirements_gears_spacy.txt
echo "spacy_sentences_geared.py finished in $SECONDS seconds."
The original version didn’t have gb.count
in it and was returning all processed records back to the client (all 11 GB), this is where I was running out of patience and/or bandwidth. Thanks to meirshandRedisLabs forums time processing time was reduced down to a reasonable 30 minutes.
Next step is also memory hungry — tokenisation and Gavin D’mello had a crack on it, but discussion on forums sparked some thoughts: I was coming with data science hat on — we have a data frame in, it’s all batch. But if you think about the NLP pipeline, only the first step — intake is batch. The rest of the pipeline doesn’t really care whether it’s batch or event-driven, steps would be the same and I can leverage more of the good code and work of RedisLabs team inside RedisGears, back to drawing boards.
Written on May 19, 2020 by Alex Mikhalev.
Originally published on Medium