Search This Blog

Tuesday, 25 March 2025

NLP BASICS

 

1. What is NLP?

NLP is a field of artificial intelligence (AI) that focuses on the interaction between computers and human languages. Its goal is to enable machines to understand, interpret, and generate human language in a way that is valuable and meaningful.

2. Text Preprocessing

Before we can start working with text data, it needs to be cleaned and prepared. This step is crucial for improving the quality of the analysis and results. Typical preprocessing steps include:

  • Tokenization: Splitting the text into smaller pieces like words or sentences.

  • Lowercasing: Converting all text to lowercase to ensure uniformity.

  • Removing punctuation: Punctuation is often removed since it does not carry much useful information in many tasks.

  • Removing stopwords: Stopwords are common words like "the," "is," "in," which don't carry significant meaning in most NLP tasks.

  • Stemming: Reducing words to their base or root form (e.g., "running" → "run").

  • Lemmatization: Similar to stemming but more advanced. It reduces words to their lemma or dictionary form (e.g., "better" → "good").

  • Removing special characters and digits: These might not be relevant for some NLP tasks.

3. Feature Extraction

Once the text is preprocessed, the next step is to extract features that machines can understand. Some common methods include:

  • Bag of Words (BoW): A representation of text where each unique word is treated as a feature. The frequency of each word in the document is counted.

  • Term Frequency-Inverse Document Frequency (TF-IDF): A more advanced technique that reflects how important a word is to a document in a collection.

  • Word Embeddings: Represent words in a dense, multi-dimensional vector space, where similar words are close to each other. Examples include Word2Vec, GloVe, and FastText.

4. Text Classification

This is one of the most common tasks in NLP. It involves categorizing text into predefined categories or labels. For example, you could classify movie reviews as "positive" or "negative."

  • Training a Model: Text features are used to train machine learning models (e.g., Logistic Regression, Naive Bayes, SVM, etc.).

  • Evaluation Metrics: Common metrics for classification tasks include accuracy, precision, recall, and F1-score.

5. Named Entity Recognition (NER)

NER is a process where the model identifies and classifies named entities (e.g., people, organizations, locations, dates) in text. For example:

  • "Apple is looking to buy a startup in London."

  • "Apple" → Organization

  • "London" → Location

6. Part-of-Speech Tagging (POS)

POS tagging involves identifying the parts of speech (e.g., noun, verb, adjective) for each word in a sentence. For example:

  • "The quick brown fox jumps over the lazy dog."

  • "The" → Determiner (DT)

  • "quick" → Adjective (JJ)

  • "fox" → Noun (NN)

POS tagging helps machines understand the grammatical structure of a sentence.

7. Sentiment Analysis

Sentiment analysis is the process of determining the sentiment or emotional tone behind a piece of text, typically classifying it as positive, negative, or neutral. For example:

  • "I love this phone!" → Positive

  • "This product is terrible." → Negative

8. Text Generation

In text generation, the goal is to create new, meaningful text based on an input. This can be applied in tasks like chatbots, story generation, and more.

  • Language Models: A model like GPT (Generative Pre-trained Transformer) generates human-like text based on a given prompt.

9. Machine Translation

This involves translating text from one language to another (e.g., English to Spanish). Modern approaches use neural networks (e.g., transformers) to achieve this.

10. Evaluation of NLP Models

Once you have built an NLP model, you need to evaluate its performance:

  • Accuracy: The percentage of correctly predicted results.

  • Precision, Recall, F1-score: Used in cases of imbalanced datasets (e.g., in sentiment analysis).

  • Confusion Matrix: A table used to describe the performance of classification models.

11. Advanced Techniques

Once you're comfortable with the basics, you can dive into more advanced techniques:

  • Transformer Models: These models (e.g., BERT, GPT) have revolutionized NLP. They use attention mechanisms to capture contextual relationships between words in a sentence.

  • Transfer Learning: Fine-tuning pre-trained models (like BERT) on a specific task to improve performance.

  • Sequence-to-Sequence Models: Used for tasks like machine translation and text summarization.

12. Real-World Applications

NLP is used in various real-world applications, including:

  • Chatbots: Automating customer support or assistance.

  • Search Engines: Understanding and ranking search queries.

  • Voice Assistants: Such as Siri, Alexa, and Google Assistant.

  • Content Recommendation: Analyzing user preferences and recommending articles, movies, etc.

Tools & Libraries for NLP

Some popular libraries that can help with NLP tasks are:

  • NLTK (Natural Language Toolkit): A comprehensive library for NLP in Python.

  • spaCy: A fast and modern NLP library.

  • Gensim: Used for topic modeling and document similarity.

  • Transformers (by Hugging Face): A powerful library for state-of-the-art transformer-based models.

  • TextBlob: A simpler NLP library for quick tasks like sentiment analysis and translation.

Conclusion

NLP is an exciting and rapidly evolving field that combines linguistics, computer science, and machine learning. By understanding the basic concepts, preprocessing steps, and key techniques, you'll be well-equipped to tackle various NLP challenges.

Would you like to dive deeper into any specific area or need help with an example?

You said:
all these with suitable simplified code examples
ChatGPT said:

Sure! Below are simplified code examples for each of the basic NLP tasks and concepts that were outlined. We'll be using Python with common libraries like NLTK, spaCy, and sklearn for this purpose.

1. Text Preprocessing

Tokenization

python
import nltk nltk.download('punkt') from nltk.tokenize import word_tokenize text = "Hello, how are you doing?" tokens = word_tokenize(text) print(tokens)

Output:

css
['Hello', ',', 'how', 'are', 'you', 'doing', '?']

Lowercasing

python
text = "Hello, How Are You Doing?" lowercased_text = text.lower() print(lowercased_text)

Output:

arduino
"hello, how are you doing?"

Removing Punctuation

python
import string text = "Hello, how are you doing?" cleaned_text = ''.join([char for char in text if char not in string.punctuation]) print(cleaned_text)

Output:

sql
Hello how are you doing

Removing Stopwords

python
from nltk.corpus import stopwords nltk.download('stopwords') stop_words = set(stopwords.words('english')) tokens = word_tokenize("This is a simple sentence.") filtered_tokens = [word for word in tokens if word.lower() not in stop_words] print(filtered_tokens)

Output:

css
['simple', 'sentence', '.']

Lemmatization

python
from nltk.stem import WordNetLemmatizer nltk.download('wordnet') lemmatizer = WordNetLemmatizer() word = "running" lemmatized_word = lemmatizer.lemmatize(word, pos='v') # 'v' for verb print(lemmatized_word)

Output:

arduino
run

2. Feature Extraction: Bag of Words (BoW)

python
from sklearn.feature_extraction.text import CountVectorizer # Sample documents documents = ["The dog barks", "The cat meows", "The dog chases the cat"] # Create BoW model vectorizer = CountVectorizer() X = vectorizer.fit_transform(documents) # Display feature names and transformed matrix print("Feature names:", vectorizer.get_feature_names_out()) print("BoW matrix:\n", X.toarray())

Output:

lua
Feature names: ['barks' 'cat' 'chases' 'dog' 'meows' 'the'] BoW matrix: [[1 0 0 1 0 1] [0 1 0 0 1 1] [0 1 1 1 0 1]]

3. Text Classification: Using Naive Bayes

python
from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split # Sample data documents = ["I love programming", "Python is amazing", "I hate bugs", "I love debugging", "Python is fun"] labels = ['positive', 'positive', 'negative', 'positive', 'positive'] # Convert text to BoW features vectorizer = CountVectorizer() X = vectorizer.fit_transform(documents) # Split data into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42) # Train Naive Bayes classifier clf = MultinomialNB() clf.fit(X_train, y_train) # Predict on test set predictions = clf.predict(X_test) print("Predictions:", predictions)

Output (example):

vbnet
Predictions: ['positive']

4. Named Entity Recognition (NER) using spaCy

python
import spacy # Load pre-trained model nlp = spacy.load('en_core_web_sm') # Sample text text = "Apple is looking to buy a startup in London." # Process the text doc = nlp(text) # Extract named entities for ent in doc.ents: print(ent.text, ent.label_)

Output:

nginx
Apple ORG London GPE

5. Part-of-Speech Tagging (POS) using spaCy

python
import spacy # Load pre-trained model nlp = spacy.load('en_core_web_sm') # Sample text text = "The quick brown fox jumps over the lazy dog." # Process the text doc = nlp(text) # Display POS tags for token in doc: print(token.text, token.pos_)

Output:

sql
The DET quick ADJ brown ADJ fox NOUN jumps VERB over ADP the DET lazy ADJ dog NOUN . PUNCT

6. Sentiment Analysis using TextBlob

python
from textblob import TextBlob # Sample text text = "I love this product!" # Create TextBlob object blob = TextBlob(text) # Perform sentiment analysis sentiment = blob.sentiment.polarity print("Sentiment polarity:", sentiment)

Output:

yaml
Sentiment polarity: 0.5

(Positive sentiment, as the value is above 0)


7. Text Generation using Hugging Face's Transformers

python
from transformers import pipeline # Load pre-trained model for text generation generator = pipeline("text-generation", model="gpt2") # Generate text based on prompt text = generator("Once upon a time", max_length=50, num_return_sequences=1) print(text[0]['generated_text'])

Output (example):

css
Once upon a time, there was a little girl named Alice who lived in a small cottage in the woods. She was always curious about the world around her, and she loved exploring the forests and fields near her home. One day, while wandering through the woods, she came across a strange door hidden among the trees.

8. Machine Translation using Hugging Face's Transformers

python
from transformers import pipeline # Load pre-trained model for translation translator = pipeline("translation_en_to_fr", model="t5-base") # Translate English to French translated_text = translator("How are you?", max_length=40) print("Translated Text:", translated_text[0]['translation_text'])

Output:

pgsql
Translated Text: Comment ça va ?

9. Advanced Techniques: BERT for Sentiment Analysis

python
from transformers import pipeline # Load pre-trained model for sentiment analysis classifier = pipeline("sentiment-analysis") # Sample text text = "I love learning about NLP!" # Predict sentiment result = classifier(text) print(result)

Output:

css
[{'label': 'POSITIVE', 'score': 0.9998683333396912}]

10. Evaluation of NLP Models

For evaluation metrics, you can use common functions from scikit-learn. For example, to calculate accuracy, precision, recall, and F1-score:

python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # True labels and predicted labels y_true = ['positive', 'positive', 'negative', 'positive', 'positive'] y_pred = ['positive', 'positive', 'negative', 'negative', 'positive'] # Calculate accuracy, precision, recall, F1-score accuracy = accuracy_score(y_true, y_pred) precision = precision_score(y_true, y_pred, pos_label='positive') recall = recall_score(y_true, y_pred, pos_label='positive') f1 = f1_score(y_true, y_pred, pos_label='positive') print(f"Accuracy: {accuracy}") print(f"Precision: {precision}") print(f"Recall: {recall}") print(f"F1-score: {f1}")

Output:

makefile
Accuracy: 0.8 Precision: 0.8333333333333334 Recall: 0.75 F1-score: 0.7894736842105263

HUE DATA ANALYSIS

 https://docs.cloudera.com/runtime/7.3.1/using-hue/hue-using.pdf



Google colab hadoop pyspark

 # Step 1: Install PySpark and Hadoop dependencies

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

!wget -q http://apache.mirrors.lucidnetworks.net/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz

!tar xvf spark-3.1.2-bin-hadoop3.2.tgz

!pip install findspark


# Step 2: Set environment variables

import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"


# Step 3: Initialize PySpark

import findspark

findspark.init()


from pyspark.sql import SparkSession


# Create the Spark session

spark = SparkSession.builder.master("local").appName("WordCount").getOrCreate()


# Step 4: Create a sample text input (simulating a file)

input_data = [

    "Hello Spark",

    "This is a test for Word Count",

    "Spark is great for big data processing",

    "Word Count is a simple example"

]


# Parallelize the list to create an RDD

rdd = spark.sparkContext.parallelize(input_data)


# Step 5: Perform Word Count (MapReduce)

word_pairs = rdd.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1))

word_count = word_pairs.reduceByKey(lambda x, y: x + y)


# Step 6: Collect and print results

result = word_count.collect()

for word, count in result:

    print(f"{word}: {count}")


# Step 7: Save the results to a local file (simulating HDFS save)

word_count.saveAsTextFile("/content/word_count_output")


pyspark hadoop wordcount in spark shell

// Step 1: Read the input file from HDFS

val inputData = sc.textFile("/user/training/bgs/input/input.txt")


// Step 2: Split the input into words

val words = inputData.flatMap(line => line.split(" "))


// Step 3: Map words to (word, 1)

val wordPairs = words.map(word => (word, 1))


// Step 4: Reduce by key to count occurrences of each word

val wordCount = wordPairs.reduceByKey((x, y) => x + y)


// Step 5: Collect and print the results

val result = wordCount.collect()

result.foreach(println)


// Step 6: Save the result to HDFS

wordCount.saveAsTextFile("hdfs://namenode:9000/user/hadoop/output/word_count")


// Step 7: Stop the SparkContext (optional)

sc.stop()



mongo db lab

 db.data.insertMany([

  { _id: 1, name: 'John Doe', age: 30, department: 'HR', salary: 60000 },

      { _id: 2, name: 'Jane Smith', age: 25, department: 'Engineering', salary: 80000 },

      { _id: 3, name: 'Sam Johnson', age: 45, department: 'Engineering', salary: 120000 },

      { _id: 4, name: 'Chris Lee', age: 35, department: 'Marketing', salary: 75000 },

      { _id: 5, name: 'Emma Brown', age: 29, department: 'HR', salary: 65000 },

      { _id: 6, name: 'Alex Taylor', age: 32, department: 'Engineering', salary: 95000 },

      { _id: 7, name: 'Sophia Williams', age: 28, department: 'Marketing', salary: 70000 },

      { _id: 8, name: 'James Davis', age: 50, department: 'Management', salary: 150000 }

])


db.data.find().sort({ age: 1 })


db.data.find().limit(3)


db.data.count({ department: "Engineering" })


db.data.aggregate([

  { $group: { _id: "$department", totalEmployees: { $sum: 1 } } }

])



db.data.aggregate([

  { $group: { _id: "$department", averageSalary: { $avg: "$salary" } } }

])




mongodb pracice one compiler

 Insert the following documents into a movies collection.

title : Fight Club
writer : Chuck Palahniuk
year : 1999
actors : [
  Brad Pitt
  Edward Norton
]
db.movies.insert({title:"Fight Club", writer: "Chuck Palahniuk", year: "1999", actors:["Brad Pitt", "Edward Norton"]})
title : Pulp Fiction
writer : Quentin Tarantino
year : 1994
actors : [
  John Travolta
  Uma Thurman
]
 db.movies.insert({title:"Pulp Fiction", writer:"Quentin Tarantino", year:"2009", actors:["John Travolta", "Uma Thurman"]})
title : Inglorious Basterds
writer : Quentin Tarantino
year : 2009
actors : [
  Brad Pitt
  Diane Kruger
  Eli Roth
]
db.movies.insert({title:"Inglorious Basterds", writer:"Quentin Tarantino", year:"2009", actors:["Brad Pitt", "Diane Kruger", "Eli Roth"]})
title : The Hobbit: An Unexpected Journey
writer : J.R.R. Tolkein
year : 2012
franchise : The Hobbit
db.movies.insert({title:"The Hobbit: An unexpected Journey", writer:"J.R.R. Tolkein", year:"2012",franchise:"The Hobbit"})
title : The Hobbit: The Desolation of Smaug
writer : J.R.R. Tolkein
year : 2013
franchise : The Hobbit
db.movies.insert({title:"The Hobbit: The Desolation of Smaug", writer:"J.R.R Tolkien", year:"2013", franchise:"The Hobbit"})
title : The Hobbit: The Battle of the Five Armies
writer : J.R.R. Tolkein
year : 2012
franchise : The Hobbit
synopsis : Bilbo and Company are forced to engage in a war against an array of combatants and keep the Lonely Mountain from falling into the hands of a rising darkness.
db.movies.insert({title:"The Hobbit: The Battle of the Five Armies", writer:"J.R.R Tolkien", year:"2002", franchise:"The Hobbit", synopsis:"Bilbo and Company are forced to engage in a war against an array of combatants and keep the Lonely Mountain from falling into the hands of a rising darkness."})
title : Pee Wee Herman's Big Adventure
db.movies.insert({title:"Pee Wee Herman's Big Adventures"})
title : Avatar

db.movies.insert({title:"Avatar"})

Query / Find Documents

query the movies collection to

  1. get all documents
db.movies.find()
  1. get all documents with writer set to "Quentin Tarantino"
db.movies.find({writer:"Quentin Tarantino"})
  1. get all documents where actors include "Brad Pitt"
db.movies.find({actors:"Brad Pitt"})
  1. get all documents with franchise set to "The Hobbit"
db.movies.find({franchise:"The Hobbit"})
  1. get all movies released in the 90s
db.movies.find({year:{$gt:"1990", $lt:"2000"}})
  1. get all movies released before the year 2000 or after 2010
db.movies.find({$or:[{year:{$gt:"2010"}},{year: {$lt:"2000"}}]})

Update Documents

  1. add a synopsis to "The Hobbit: An Unexpected Journey" : "A reluctant hobbit, Bilbo Baggins, sets out to the Lonely Mountain with a spirited group of dwarves to reclaim their mountain home - and the gold within it - from the dragon Smaug."
db.movies.update({_id:ObjectId("5c9f98e5e5c2dfe9b3729bfe")}, {$set:{synopsis:"A reluctant hobbit, Bilbo Baggins, sets out to the Lonely Mountain with a spirited group of dwarves to reclaim their mountain home - and the gold within it - from the dragon Smaug."}})
  1. add a synopsis to "The Hobbit: The Desolation of Smaug" : "The dwarves, along with Bilbo Baggins and Gandalf the Grey, continue their quest to reclaim Erebor, their homeland, from Smaug. Bilbo Baggins is in possession of a mysterious and magical ring."
db.movies.update({_id:ObjectId("5c9fa42ae5c2dfe9b3729c03")}, {$set:{synopsis:"The dwarves, along with Bilbo Baggins and Gandalf the Grey, continue their quest to reclaim Erebor, their homeland, from Smaug. Bilbo Baggins is in possession of a mysterious and magical ring."}})
  1. add an actor named "Samuel L. Jackson" to the movie "Pulp Fiction"
db.movies.update({_id:ObjectId("5c9f983ce5c2dfe9b3729bfc")}, {$push:{actors:"Samuel L. Jackson"}})

Text Search

  1. find all movies that have a synopsis that contains the word "Bilbo"
db.movies.find({synopsis:{$regex:"Bilbo"}})
  1. find all movies that have a synopsis that contains the word "Gandalf"
db.movies.find({synopsis:{$regex:"Gandalf"}})
  1. find all movies that have a synopsis that contains the word "Bilbo" and not the word "Gandalf"
db.movies.find({$and:[{synopsis:{$regex:"Bilbo"}}, {synopsis:{$not:/Gandalf/}}]})
  1. find all movies that have a synopsis that contains the word "dwarves" or "hobbit"
db.movies.find({$or:[{synopsis:{$regex:"dwarves"}}, {synopsis:{$regex:"hobbit"}}]})
  1. find all movies that have a synopsis that contains the word "gold" and "dragon"
db.movies.find({$and:[{synopsis:{$regex:"gold"}}, {synopsis:{$regex:"dragon"}}]})

Delete Documents

  1. delete the movie "Pee Wee Herman's Big Adventure"
db.movies.remove({_id:ObjectId("5c9f992ae5c2dfe9b3729c00")})
  1. delete the movie "Avatar"
db.movies.remove({_id:ObjectId("5c9f9936e5c2dfe9b3729c01")})

Relationships

Insert the following documents into a users collection

username : GoodGuyGreg
first_name : "Good Guy"
last_name : "Greg"
db.users.insert({_id:1,username:"GoodGuyGreg", first_name:"Good Guy", last_name:"Greg"})
username : ScumbagSteve
full_name :
  first : "Scumbag"
  last : "Steve"

db.users.insert({_id:2, username:"ScumbagSteve", fullname:{first: "Scumbag", last:"Steve"}})

Insert the following documents into a posts collection

username : GoodGuyGreg
title : Passes out at party
body : Wakes up early and cleans house
db.posts.insert({username:"GoodGuyGreg", title:"Passes out at Party", body:"Raises your credit score"})
username : GoodGuyGreg
title : Steals your identity
body : Raises your credit score
db.posts.insert({ username:"GoodGuyGreg", title:"Steals your identity", body:"Raises your credit score"})
username : GoodGuyGreg
title : Reports a bug in your code
body : Sends you a Pull Request
db.posts.insert({username:"GoodGuyGreg", title:"Reports a bug in your code", body:"Sends you a pull request"})
username : ScumbagSteve
title : Borrows something
body : Sells it
db.posts.insert({ username:"ScumbagSteve", title:"Borrows something", body:"Sells it"})
username : ScumbagSteve
title : Borrows everything
body : The end
db.posts.insert({ username:"ScumbagSteve", title:"Borrows everything", body:"The end"})
username : ScumbagSteve
title : Forks your repo on github
body : Sets to private
db.posts.insert({username:"ScumbagSteve", title:"Forks your repo on github", body:"Sets to private"})

Insert the following documents into a comments collection

username : GoodGuyGreg
comment : Hope you got a good deal!
post : [post_obj_id]

where [post_obj_id] is the ObjectId of the posts document: "Borrows something"

db.comments.insert({ username:"GoodGuyGreg", comment:"Hope you got a good deal!", post:ObjectId("5ca0b7e96435f98b5901f463")})
username : GoodGuyGreg
comment : What's mine is yours!
post : [post_obj_id]

where [post_obj_id] is the ObjectId of the posts document: "Borrows everything"

db.comments.insert({username:"GoodGuyGreg", comment:"What's mine is yours!", post:ObjectId("5ca0b9706435f98b5901f46a")})
username : GoodGuyGreg
comment : Don't violate the licensing agreement!
post : [post_obj_id]

where [post_obj_id] is the ObjectId of the posts document: "Forks your repo on github

db.comments.insert({username:"GoodGuyGreg", comment:"Don't violate the licensing agreement!", post:ObjectId("5ca0b8766435f98b5901f467")})
username : ScumbagSteve
comment : It still isn't clean
post : [post_obj_id]

where [post_obj_id] is the ObjectId of the posts document: "Passes out at party"

db.comments.insert({username:"ScumbagSteve", comment:"It still isn't clean", post:ObjectId("5ca0b8546435f98b5901f466")})
username : ScumbagSteve
comment : Denied your PR cause I found a hack
post : [post_obj_id]

where [post_obj_id] is the ObjectId of the posts document: "Reports a bug in your code"

db.comments.insert({username:"ScumbagSteve", comment:"Denied your PR cause I found a hack", post:ObjectId("5ca0b9256435f98b5901f469")})

Querying related collections

  1. find all users
db.users.find().pretty()
  1. find all posts
db.posts.find().pretty()
  1. find all posts that was authored by "GoodGuyGreg"
db.posts.find({username:"GoodGuyGreg"})
  1. find all posts that was authored by "ScumbagSteve"
db.posts.find({username:"ScumbagSteve"})
  1. find all comments
db.comments.find().pretty()
  1. find all comments that was authored by "GoodGuyGreg"
db.comments.find({username:"GoodGuyGreg"})
  1. find all comments that was authored by "ScumbagSteve"
db.comments.find({username:"ScumbagSteve"})
  1. find all comments belonging to the post "Reports a bug in your code"

Import a csv to mongodb

{ "_id" : "02906", "city" : "PROVIDENCE", "pop" : 31069, "state" : "RI", "capital" : { "name" : "Providence", "electoralCollege" : 4 } }
{ "_id" : "02108", "city" : "BOSTON", "pop" : 3697, "state" : "MA", "capital" : { "name" : "Boston", "electoralCollege" : 11 } }
{ "_id" : "10001", "city" : "NEW YORK", "pop" : 18913, "state" : "NY", "capital" : { "name" : "Albany", "electoralCollege" : 29 } }
{ "_id" : "01012", "city" : "CHESTERFIELD", "pop" : 177, "state" : "MA", "capital" : { "name" : "Boston", "electoralCollege" : 11 } }
{ "_id" : "32801", "city" : "ORLANDO", "pop" : 9275, "state" : "FL", "capital" : { "name" : "Tallahassee", "electoralCollege" : 29 } }
{ "_id" : "12966", "city" : "BANGOR", "pop" : 2867, "state" : "NY", "capital" : { "name" : "Albany", "electoralCollege" : 29 } }
{ "_id" : "32920", "city" : "CAPE CANAVERAL", "pop" : 7655, "state" : "FL", "capital" : { "name" : "Tallahassee", "electoralCollege" : 29 } }
{ "_id" : "NY", "name" : "New York", "pop" : 28300000, "state" : 1788 }
{ "_id" : "33125", "city" : "MIAMI", "pop" : 47761, "state" : "FL", "capital" : { "name" : "Tallahassee", "electoralCollege" : 29 } }
{ "_id" : "RI", "name" : "Rhode Island", "pop" : 1060000, "state" : 1790 }
{ "_id" : "MA", "name" : "Massachusetts", "pop" : 6868000, "state" : 1790 }
{ "_id" : "FL", "name" : "Florida", "pop" : 6800000, "state" : 1845 }
{ "_id" : "1", "name" : "Tom", "addresses" : [ "01001", "12997" ] }
{ "_id" : "02907", "city" : "CRANSTON", "pop" : 25668, "state" : "RI", "capital" : { "name" : "Providence", "electoralCollege" : 4 } }
{ "_id" : "2", "name" : "Bill", "addresses" : [ "01001", "12967", "32920" ] }
{ "_id" : "3", "name" : "Mary", "addresses" : [ "32801", "32920", "33125" ] }
{ "_id" : "12967", "city" : "NORTH LAWRENCE", "pop" : 943, "state" : "NY", "capital" : { "name" : "Albany", "electoralCollege" : 29 } }
{ "_id" : "01001", "city" : "AGAWAM", "pop" : 15338, "state" : "MA", "capital" : { "name" : "Boston", "electoralCollege" : 11 } }
{ "_id" : "12997", "city" : "WILMINGTON", "pop" : 958, "state" : "NY", "capital" : { "name" : "Albany", "electoralCollege" : 29 } }

mongoimport --db <database_name> --collection <collection_name> --file

  1. Show name and population of the cities where the population is over 10000
db.docs.find({city:{$exists:true}}, {_id:0, city:1, pop:1},{ pop:{$gt:10000}})
  1. Show the name and population of the state based on the cities shown
db.docs.aggregate([{$match:{city:{$exists: true}}},{$group:{_id:"$state", "Total Pop": {$sum:"$pop"}}}])
  1. Show the total cities in NY as 'Population'
db.docs.aggregate([{$match:{state:"NY"}},{$group:{_id:"$state", "Total Pop": {$sum:"$pop"}}}])
  1. Show the _id, city, name of the capital city of each state with a popultaion greater than 20,000.
db.docs.find({city:{$exists:true}, pop:{$gt:20000}}, {city:1, "capital.name":1})

Hadoop Analytics

NLP BASICS

  1. What is NLP? NLP is a field of artificial intelligence (AI) that focuses on the interaction between computers and human languages. Its...