{"id":1709,"date":"2024-06-19T10:59:31","date_gmt":"2024-06-19T02:59:31","guid":{"rendered":"https:\/\/www.strongd.net\/?p=1709"},"modified":"2024-06-19T10:59:49","modified_gmt":"2024-06-19T02:59:49","slug":"a-tour-of-python-nlp-libraries","status":"publish","type":"post","link":"https:\/\/www.strongd.net\/?p=1709","title":{"rendered":"A Tour of Python NLP Libraries"},"content":{"rendered":"<p>NLP, or Natural Language Processing, is a field within Artificial Intelligence that focuses on the interaction between human language and computers. It tries to explore and apply text data so computers can understand the text meaningfully.<\/p>\n<p>As the NLP field research progresses, how we process text data in computers has evolved. Modern times, we have used Python to help explore and process data easily.<\/p>\n<p>With Python becoming the go-to language for exploring text data, many libraries have been developed specifically for the NLP field. In this article, we will explore various incredible and useful NLP libraries.<\/p>\n<div id=\"kdnug-2926ea349c05faa19b3d5886a23a271b\" class=\"kdnug-2926ea349c05faa19b3d5886a23a271b kdnug-ros-mobile-in-content\"><\/div>\n<p>So, let\u2019s get into it.<\/p>\n<h2>NLTK<\/h2>\n<p><a href=\"https:\/\/www.nltk.org\/\" target=\"_blank\" rel=\"noopener\">NLTK<\/a>, or Natural Language Tool Kit, is an NLP Python library with many text-processing APIs and industrial-grade wrappers. It\u2019s one of the biggest NLP Python libraries used by researchers, data scientists, engineers, and others. It\u2019s a standard NLP Python library for NLP tasks.<\/p>\n<p>Let\u2019s try to explore what NLTK could do. First, we would need to install the library with the following code.<\/p>\n<div>\n<pre><code>pip install -U nltk\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Next, we would see what NLTK could do. First, NLTK can perform the tokenization process using the following code:<\/p>\n<div>\n<pre><code>import nltk from nltk.tokenize\r\nimport word_tokenize\r\n\r\n# Download the necessary resources\r\nnltk.download('punkt')\r\n\r\ntext = \"The fruit in the table is a banana\"\r\ntokens = word_tokenize(text)\r\n\r\nprint(tokens)\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<div>\n<pre><code>Output&gt;&gt; \r\n['The', 'fruit', 'in', 'the', 'table', 'is', 'a', 'banana']\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Tokenization basically would divide each word in a sentence into individual data.<\/p>\n<p>With NLTK, we can also perform Part-of-Speech (POS) Tags on the text sample.<\/p>\n<div>\n<pre><code>from nltk.tag import pos_tag\r\n\r\nnltk.download('averaged_perceptron_tagger')\r\n\r\ntext = \"The fruit in the table is a banana\"\r\npos_tags = pos_tag(tokens)\r\n\r\nprint(pos_tags)\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<div>\n<pre><code>Output&gt;&gt;\r\n[('The', 'DT'), ('fruit', 'NN'), ('in', 'IN'), ('the', 'DT'), ('table', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('banana', 'NN')]\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The output of the POS tagger with NLTK is each token and its intended POS tags. For example, the word Fruit is Noun (NN), and the word \u2018a\u2019 is Determinant (DT).<\/p>\n<p>It\u2019s also possible to perform Stemming and Lemmatization with NLTK. Stemming is reducing a word to its base form by cutting its prefixes and suffixes, while Lemmatization also transforms to the base form by considering the words&#8217; POS and morphological analysis.<\/p>\n<div>\n<pre><code>from nltk.stem import PorterStemmer, WordNetLemmatizer\r\nnltk.download('wordnet')\r\nnltk.download('punkt')\r\n\r\ntext = \"The striped bats are hanging on their feet for best\"\r\ntokens = word_tokenize(text)\r\n\r\n# Stemming\r\nstemmer = PorterStemmer()\r\nstems = [stemmer.stem(token) for token in tokens]\r\nprint(\"Stems:\", stems)\r\n\r\n# Lemmatization\r\nlemmatizer = WordNetLemmatizer()\r\nlemmas = [lemmatizer.lemmatize(token) for token in tokens]\r\nprint(\"Lemmas:\", lemmas)\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<div>\n<pre><code>Output&gt;&gt; \r\nStems: ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best']\r\nLemmas: ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'for', 'best']\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>You can see that the stemming and lentmatization processes have slightly different results from the words.<\/p>\n<p>That\u2019s the simple usage of NLTK. You can still do many things with them, but the above APIs are the most commonly used.<\/p>\n<div id=\"AdThrive_Content_2_desktop\" class=\"adthrive-ad adthrive-content adthrive-content-1 adthrive-ad-cls up-show\" data-google-query-id=\"CKXmoYvW5oYDFczTFgUdWQ0DxQ\">\n<div class=\"ahover\">\n<div><\/div>\n<p><iframe loading=\"lazy\" src=\"https:\/\/www.kdnuggets.com\/robots.txt?upapi=true\" width=\"728\" height=\"90\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\" data-mce-fragment=\"1\"><\/iframe><\/p>\n<div class=\"upo-label\"><\/div>\n<\/div>\n<div id=\"google_ads_iframe_\/18190176,22523581566\/AdThrive_Content_2\/6580aa6286340b3292ce8665_0__container__\"><iframe loading=\"lazy\" id=\"google_ads_iframe_\/18190176,22523581566\/AdThrive_Content_2\/6580aa6286340b3292ce8665_0\" tabindex=\"0\" title=\"3rd party ad content\" src=\"https:\/\/e908a317df8d549e7c963003b2a896a2.safeframe.googlesyndication.com\/safeframe\/1-0-40\/html\/container.html?upapi=true\" name=\"\" width=\"1\" height=\"1\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\" sandbox=\"allow-forms allow-popups allow-popups-to-escape-sandbox allow-same-origin allow-scripts allow-top-navigation-by-user-activation\" data-is-safeframe=\"true\" aria-label=\"Advertisement\" data-google-container-id=\"2\" data-load-complete=\"true\" data-mce-fragment=\"1\"><\/iframe><\/div>\n<\/div>\n<h2>SpaCy<\/h2>\n<p><a href=\"https:\/\/spacy.io\/\" target=\"_blank\" rel=\"noopener\">SpaCy<\/a>\u00a0is an NLP Python library that is designed specifically for production use. It\u2019s an advanced library, and SpaCy is known for its performance and ability to handle large amounts of text data. It\u2019s a preferable library for industry use in many NLP cases.<\/p>\n<p>To install SpaCy, you can look at their\u00a0<a href=\"https:\/\/spacy.io\/usage\" target=\"_blank\" rel=\"noopener\">usage page<\/a>. Depending on your requirements, there are many combinations to choose from.<\/p>\n<p>Let\u2019s try using SpaCy for the NLP task. First, we would try performing Named Entity Recognition (NER) with the library. NER is a process of identifying and classifying named entities in text into predefined categories, such as person, address, location, and more.<\/p>\n<div>\n<pre><code>import spacy\r\n\r\nnlp = spacy.load(\"en_core_web_sm\")\r\n\r\ntext = \"Brad is working in the U.K. Startup called AIForLife for 7 Months.\"\r\ndoc = nlp(text)\r\n#Perform the NER\r\nfor ent in doc.ents:\r\n    print(ent.text, ent.label_)\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<div>\n<pre><code>Output&gt;&gt;\r\nBrad PERSON\r\nthe U.K. Startup ORG\r\n7 Months DATE\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>As you can see, the SpaCy pre-trained model understands which word within the document can be classified.<\/p>\n<p>Next, we can use SpaCy to perform Dependency Parsing and visualize them. Dependency Parsing is a process of understanding how each word relates to the other by forming a tree structure.<\/p>\n<div>\n<pre><code>import spacy\r\nfrom spacy import displacy\r\n\r\nnlp = spacy.load(\"en_core_web_sm\")\r\n\r\ntext = \"SpaCy excels at dependency parsing.\"\r\ndoc = nlp(text)\r\nfor token in doc:\r\n    print(f\"{token.text}: {token.dep_}, {token.head.text}\")\r\n\r\ndisplacy.render(doc, jupyter=True)\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<div>\n<pre><code>Output&gt;&gt; \r\nBrad: nsubj, working\r\nis: aux, working\r\nworking: ROOT, working\r\nin: prep, working\r\nthe: det, Startup\r\nU.K.: compound, Startup\r\nStartup: pobj, in\r\ncalled: advcl, working\r\nAIForLife: oprd, called\r\nfor: prep, called\r\n7: nummod, Months\r\nMonths: pobj, for\r\n.: punct, working\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The output should include all the words with their POS and where they are related. The code above would also provide tree visualization in your Jupyter Notebook.<\/p>\n<p>Lastly, let\u2019s try performing text similarity with SpaCy. Text similarity measures how similar or related two pieces of text are. It has many techniques and measurements, but we will try the simplest one.<\/p>\n<div>\n<pre><code>import spacy\r\n\r\nnlp = spacy.load(\"en_core_web_sm\")\r\n\r\ndoc1 = nlp(\"I like pizza\")\r\ndoc2 = nlp(\"I love hamburger\")\r\n\r\n# Calculate similarity\r\nsimilarity = doc1.similarity(doc2)\r\nprint(\"Similarity:\", similarity)\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<div>\n<pre><code>Output&gt;&gt;\r\nSimilarity: 0.6159097609586724\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The similarity measure measures the similarity between texts by providing an output score, usually between 0 and 1. The closer the score is to 1, the more similar both texts are.<\/p>\n<p>There are still many things you can do with SpaCy. Explore the documentation to find something useful for your work.<\/p>\n<h2>TextBlob<\/h2>\n<p><a href=\"https:\/\/textblob.readthedocs.io\/en\/dev\/\" target=\"_blank\" rel=\"noopener\">TextBlob<\/a>\u00a0is an NLP Python library for processing textual data built on top of NLTK. It simplifies many of NLTK&#8217;s usage and can streamline text processing tasks.<\/p>\n<p>You can install TextBlob using the following code:<\/p>\n<div>\n<pre><code>pip install -U textblob\r\npython -m textblob.download_corpora\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>First, let\u2019s try to use TextBlob for NLP tasks. The first one we would try is to do sentiment analysis with TextBlob. We can do that with the code below.<\/p>\n<div>\n<pre><code>from textblob import TextBlob\r\n\r\ntext = \"I am in the top of the world\"\r\nblob = TextBlob(text)\r\nsentiment = blob.sentiment\r\n\r\nprint(sentiment)\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<div>\n<pre><code>Output&gt;&gt;\r\nSentiment(polarity=0.5, subjectivity=0.5)\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The output is a polarity and subjectivity score. Polarity is the sentiment of the text where the score ranges from -1 (negative) to 1 (positive). At the same time, the subjectivity score ranges from 0 (objective) to 1 (subjective).<\/p>\n<p>We can also use TextBlob for text correction tasks. You can do that with the following code.<\/p>\n<div>\n<pre><code>from textblob import TextBlob\r\n\r\ntext = \"I havv goood speling.\"\r\nblob = TextBlob(text)\r\n\r\n# Spelling Correction\r\ncorrected_blob = blob.correct()\r\nprint(\"Corrected Text:\", corrected_blob)\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<div>\n<pre><code>Output&gt;&gt;\r\nCorrected Text: I have good spelling.\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Try to explore the TextBlob packages to find the APIs for your text tasks.<\/p>\n<h2>Gensim<\/h2>\n<p><a href=\"https:\/\/radimrehurek.com\/gensim\/\" target=\"_blank\" rel=\"noopener\">Gensim<\/a>\u00a0is an open-source Python NLP library specializing in topic modeling and document similarity analysis, especially for big and streaming data. It focuses more on industrial real-time applications.<\/p>\n<p>Let\u2019s try the library. First, we can install them using the following code:<\/p>\n<div>\n<pre><code>pip install gensim\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>After the installation is finished, we can try the Gensim capability. Let\u2019s try to do topic modeling with LDA using Gensim.<\/p>\n<div>\n<pre><code>import gensim\r\nfrom gensim import corpora\r\nfrom gensim.models import LdaModel\r\n\r\n# Sample documents\r\ndocuments = [\r\n    \"Tennis is my favorite sport to play.\",\r\n    \"Football is a popular competition in certain country.\",\r\n    \"There are many athletes currently training for the olympic.\"\r\n]\r\n\r\n# Preprocess documents\r\ntexts = [[word for word in document.lower().split()] for document in documents]\r\n\r\ndictionary = corpora.Dictionary(texts)\r\ncorpus = [dictionary.doc2bow(text) for text in texts]\r\n\r\n\r\n#The LDA model\r\nlda_model = LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)\r\n\r\ntopics = lda_model.print_topics()\r\nfor topic in topics:\r\n    print(topic)\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<div>\n<pre><code>Output&gt;&gt;\r\n(0, '0.073*\"there\" + 0.073*\"currently\" + 0.073*\"olympic.\" + 0.073*\"the\" + 0.073*\"athletes\" + 0.073*\"for\" + 0.073*\"training\" + 0.073*\"many\" + 0.073*\"are\" + 0.025*\"is\"')\r\n(1, '0.094*\"is\" + 0.057*\"football\" + 0.057*\"certain\" + 0.057*\"popular\" + 0.057*\"a\" + 0.057*\"competition\" + 0.057*\"country.\" + 0.057*\"in\" + 0.057*\"favorite\" + 0.057*\"tennis\"')\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The output is a combination of words from the document samples that cohesively become a topic. You can evaluate whether the result makes sense or not.<\/p>\n<p>Gensim also provides a way for users to embed content. For example, we use Word2Vec to create embedding from words.<\/p>\n<div>\n<pre><code>import gensim\r\nfrom gensim.models import Word2Vec\r\n\r\n# Sample sentences\r\nsentences = [\r\n    ['machine', 'learning'],\r\n    ['deep', 'learning', 'models'],\r\n    ['natural', 'language', 'processing']\r\n]\r\n\r\n# Train Word2Vec model\r\nmodel = Word2Vec(sentences, vector_size=20, window=5, min_count=1, workers=4)\r\n\r\nvector = model.wv['machine']\r\nprint(vector)\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<div>\n<pre><code>\r\nOutput&gt;&gt;\r\n[ 0.01174188 -0.02259516  0.04194366 -0.04929082  0.0338232   0.01457208\r\n -0.02466416  0.02199094 -0.00869787  0.03355692  0.04982425 -0.02181222\r\n -0.00299669 -0.02847819  0.01925411  0.01393313  0.03445538  0.03050548\r\n  0.04769249  0.04636709]\r\n<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>There are still many applications you can use with Gensim. Try to see the documentation and evaluate your needs.<\/p>\n<div id=\"AdThrive_Content_5_desktop\" class=\"adthrive-ad adthrive-content adthrive-content-1 adthrive-ad-cls up-show\" data-google-query-id=\"CIaqoezK5oYDFdHBFgUdnF8KxQ\"><\/div>\n<h2>Conclusion<\/h2>\n<p>&nbsp;<\/p>\n<p>In this article, we explored several Python NLP libraries essential for many text tasks. All of these libraries would be useful for your work, from Text Tokenization to Word Embedding. The libraries we are discussing are:<\/p>\n<ol>\n<li>NLTK<\/li>\n<li>SpaCy<\/li>\n<li>TextBlob<\/li>\n<li>Gensim<\/li>\n<\/ol>\n<p>I hope it helps<\/p>\n","protected":false},"excerpt":{"rendered":"<p>NLP, or Natural Language Processing, is a field within Artificial Intelligence that focuses on the interaction between human language and computers. It tries to explore and apply text data so computers can understand the text meaningfully. As the NLP field research progresses, how we process text data in computers has evolved. Modern times, we have &hellip; <a href=\"https:\/\/www.strongd.net\/?p=1709\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">A Tour of Python NLP Libraries<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"gallery","meta":{"footnotes":""},"categories":[35],"tags":[277,276],"class_list":["post-1709","post","type-post","status-publish","format-gallery","hentry","category-35","tag-nlp","tag-python","post_format-post-format-gallery"],"_links":{"self":[{"href":"https:\/\/www.strongd.net\/index.php?rest_route=\/wp\/v2\/posts\/1709","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.strongd.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.strongd.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.strongd.net\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.strongd.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1709"}],"version-history":[{"count":1,"href":"https:\/\/www.strongd.net\/index.php?rest_route=\/wp\/v2\/posts\/1709\/revisions"}],"predecessor-version":[{"id":1710,"href":"https:\/\/www.strongd.net\/index.php?rest_route=\/wp\/v2\/posts\/1709\/revisions\/1710"}],"wp:attachment":[{"href":"https:\/\/www.strongd.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1709"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.strongd.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1709"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.strongd.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1709"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}