NLP with Prompt Engineering#

Importing libraries#

Background:

  • LLMs are excellent tools for NLP and they work really well.

I will demonstrate:

  • Langchain Prompting

  • Using Pydantic to parse LLM output into a structured response

  • Prompt-poet for maintaining and writing prompts

  • Zero Shot Text Classfication

  • Few Shot Text Classification

  • Validation of the Models

from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq
from typing import List, Literal, Annotated
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.tools import tool
import getpass
from typing import List, Optional
import numpy as np
from datasets import load_dataset
from prompt_poet import Prompt
from tqdm import tqdm
import pandas as pd
from unicodedata import normalize
/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/IPython/core/interactiveshell.py:3577: LangChainDeprecationWarning: As of langchain-core 0.3.0, LangChain uses pydantic v2 internally. The langchain_core.pydantic_v1 module was a compatibility shim for pydantic v1, and should no longer be used. Please update the code to import from Pydantic directly.

For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 9
      7 from typing import List, Optional
      8 import numpy as np
----> 9 from datasets import load_dataset
     10 from prompt_poet import Prompt
     11 from tqdm import tqdm

ModuleNotFoundError: No module named 'datasets'
def norm_text(input_text):
    return normalize('NFKD', input_text).encode('ascii','ignore').decode('ascii')

Groq Init#

groq_api_key = getpass.getpass()
llama_70b_llm = ChatGroq(api_key=groq_api_key, temperature=0, model_name="llama3-groq-70b-8192-tool-use-preview")

Classification Models#

  • Spam vs Ham

  • Ticket Categorization

  • Fake News Detection

  • Medical Records to Disease etc.

Traditional Models#

  • Data Collection and Labelling

  • Feature Engineering

  • Model Training & Validation

  • Once the model is built, inference is efficient because of small model sizes.

Approaches

  • TfIdf Vectorization + Classifier (Navie Bayes, Logistic Regression etc.)

  • Word2Vec + Classifier Head

  • BERT encoder + Classification

Drawbacks

  • Time to label datasets

  • Out of Vocabulary Words

  • Generalization Error

LLMs#

  • Zero Shot or few shot learners

  • No feature engineering required

  • Can be finetuned

  • Great out-of-sample performance

Drawbacks

  • Huge parameter LLMs at the backend

  • Proprietary models, data security and privacy issues

  • Inference is costly, takes time, needs high coumputing resources

  • Smaller LLMs with similar performance on the level of GPT are needed for finetuning.

Zero shot Example#

raw_template  = """
- name: system instructions
  role: system
  content: |
   You are an expert in classifying a given text into {{ text_classfication_classes }}

- name: user query
  role: user
  content: |
   Please extract label of the following text.
   {{ norm_text(text) }}
"""
template_data = {"text_classfication_classes": "Spam or Ham", "text": "Win $1000000 NOW!!!",
                "norm_text":norm_text}
prompt = Prompt(
    raw_template=raw_template,
    template_data=template_data
)
prompt.messages
[{'role': 'system',
  'content': 'You are an expert in classifying a given text into Spam or Ham'},
 {'role': 'user',
  'content': 'Please extract label of the following text.\nWin $1000000 NOW!!!'}]
response = llama_70b_llm.invoke(prompt.messages)
print(response.content)
The label for the given text is "Spam".

Adding Structured Outputs#

class Classification(BaseModel):
    """Function that Classifies the text into Spam or Ham"""
    classification_label: str= Field(default=None,enum=["Spam","Ham","spam","ham"])
    explanation: str = Field(default=None,description="Explain why you gave that label to this text. Keep your answers short and precise. I will tip you $20 for a good explanation. ")
llama_70b_llm_cls_head = llama_70b_llm.with_structured_output(Classification)
prompt.messages
[{'role': 'system',
  'content': 'You are an expert in classifying a given text into Spam or Ham'},
 {'role': 'user',
  'content': 'Please extract label of the following text.\nWin $1000000 NOW!!!'}]
result = llama_70b_llm_cls_head.invoke(prompt.messages)
result
Classification(classification_label='Spam', explanation='The text contains an exaggerated claim of winning a large sum of money, which is a common tactic used in spam messages.')
template_data = {"text_classfication_classes": "Spam and Ham", "text": "Change in TER Schemes of quant mutual fund",
                "norm_text": norm_text}

non_spam_prompt = Prompt(
    raw_template=raw_template,
    template_data=template_data
)
prompt = Prompt(
    raw_template=raw_template,
    template_data=template_data
)
result = llama_70b_llm_cls_head.invoke(prompt.messages)
result
Classification(classification_label='Ham', explanation='The text is about a change in a mutual fund scheme, which is a legitimate topic and not spam.')

Few Shot Examples#

  • In Zero shot learning, we are only relying on LLMs pretraining

  • In a few shot approach, we feed the LLM with few examples from the training set and their labels.

ds = load_dataset("ucirvine/sms_spam")['train'].train_test_split(test_size=0.01,stratify_by_column="label")

Let’s select 5 examples from each class to train few shot model.

ds['train'].features['label'].names
['ham', 'spam']
def generate_samples(dataset, num_samples_per_class=5, label_column=None, text_column=None):
    if label_column is None or text_column is None:
        raise ValueError("Both label_column and text_column must be provided.")

    # Get unique labels and shuffle the dataset
    unique_labels = dataset.unique(label_column)
    dataset = dataset.shuffle(seed=42)
    label_names = dataset.features[label_column].names
    # Initialize a dictionary to store samples per class name
    samples_per_class = {label_name: [] for label_name in label_names}

    # Collect samples for each class
    for example in dataset:
        label = example[label_column]
        label_name = label_names[label]
        if len(samples_per_class[label_name]) < num_samples_per_class:
            samples_per_class[label_name].append(example)

    # Create a list of {label, text} pairs
    label_text_pairs = []
    for label_name, samples in samples_per_class.items():
        for sample in samples:
            label_text_pairs.append({"label": label_name, "text": sample[text_column]})

    # Yield (text, label) pairs
    for each_sample in label_text_pairs:
        yield (norm_text(each_sample['text'].strip()), each_sample['label'].strip())
samples = generate_samples(ds['train'],text_column='sms',label_column='label')
samples = list(samples)
print(samples[0][0])
Been running but only managed 5 minutes and then needed oxygen! Might have to resort to the roller option!
few_shot_template = """
- name: system instructions
  role: system
  content: |
   You are an expert in classifying a given text into {{ text_classfication_classes }}.
   These are some of the examples that you can use to do this task.
   {% for each_example, each_label in samples %} 
   Text: {{ each_example }} Label: {{ each_label}}
   {% endfor %}

- name: user query
  role: user
  content: |
   Extract the properties listed in Classification function : {{ escape_special_characters(text) }} 
"""
template_data = {"text_classfication_classes": "Spam or Ham",
                "text": "No Deposit Required. Play for FREE and Win for Real!..-ettzhr.",
                "samples":samples}
few_shot_prompt = Prompt(
    raw_template=few_shot_template,
    template_data=template_data
)
few_shot_prompt.messages
[{'role': 'system',
  'content': "You are an expert in classifying a given text into Spam or Ham.\nThese are some of the examples that you can use to do this task.\n \nText: Been running but only managed 5 minutes and then needed oxygen! Might have to resort to the roller option! Label: ham\n \nText: Omg how did u know what I ate? Label: ham\n \nText: Hi here. have birth at on the  to  at 8lb 7oz. Mother and baby doing brilliantly. Label: ham\n \nText: Haha yeah, 2 oz is kind of a shitload Label: ham\n \nText: Aah! A cuddle would be lush! I'd need lots of tea and soup before any kind of fumbling! Label: ham\n \nText: Free video camera phones with Half Price line rental for 12 mths and 500 cross ntwk mins 100 txts. Call MobileUpd8 08001950382 or Call2OptOut/674 Label: spam\n \nText: U have a secret admirer who is looking 2 make contact with U-find out who they R*reveal who thinks UR so special-call on 09058094599 Label: spam\n \nText: If you don't, your prize will go to another customer. T&C at www.t-c.biz 18+ 150p/min Polo Ltd Suite 373 London W1J 6HL Please call back if busy Label: spam\n \nText: As one of our registered subscribers u can enter the draw 4 a 100 G.B. gift voucher by replying with ENTER. To unsubscribe text STOP Label: spam\n \nText: **FREE MESSAGE**Thanks for using the Auction Subscription Service. 18 . 150p/MSGRCVD 2 Skip an Auction txt OUT. 2 Unsubscribe txt STOP CustomerCare 08718726270 Label: spam"},
 {'role': 'user',
  'content': 'Extract the properties listed in Classification function : No Deposit Required. Play for FREE and Win for Real!..-ettzhr.'}]
print(few_shot_prompt.messages[0]['content'])
You are an expert in classifying a given text into Spam or Ham.
These are some of the examples that you can use to do this task.
 
Text: Been running but only managed 5 minutes and then needed oxygen! Might have to resort to the roller option! Label: ham
 
Text: Omg how did u know what I ate? Label: ham
 
Text: Hi here. have birth at on the  to  at 8lb 7oz. Mother and baby doing brilliantly. Label: ham
 
Text: Haha yeah, 2 oz is kind of a shitload Label: ham
 
Text: Aah! A cuddle would be lush! I'd need lots of tea and soup before any kind of fumbling! Label: ham
 
Text: Free video camera phones with Half Price line rental for 12 mths and 500 cross ntwk mins 100 txts. Call MobileUpd8 08001950382 or Call2OptOut/674 Label: spam
 
Text: U have a secret admirer who is looking 2 make contact with U-find out who they R*reveal who thinks UR so special-call on 09058094599 Label: spam
 
Text: If you don't, your prize will go to another customer. T&C at www.t-c.biz 18+ 150p/min Polo Ltd Suite 373 London W1J 6HL Please call back if busy Label: spam
 
Text: As one of our registered subscribers u can enter the draw 4 a 100 G.B. gift voucher by replying with ENTER. To unsubscribe text STOP Label: spam
 
Text: **FREE MESSAGE**Thanks for using the Auction Subscription Service. 18 . 150p/MSGRCVD 2 Skip an Auction txt OUT. 2 Unsubscribe txt STOP CustomerCare 08718726270 Label: spam
prompt.messages
[{'role': 'system',
  'content': 'You are an expert in classifying a given text into Spam and Ham'},
 {'role': 'user',
  'content': 'Please extract label of the following text.\nChange in TER Schemes of quant mutual fund'}]
llama_70b_llm_cls_head
RunnableBinding(bound=ChatGroq(client=<groq.resources.chat.completions.Completions object at 0x7fdb2ba48ec0>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x7fdb2ba49820>, model_name='llama3-groq-70b-8192-tool-use-preview', temperature=1e-08, groq_api_key=SecretStr('**********')), kwargs={'tools': [{'type': 'function', 'function': {'name': 'Classification', 'description': 'Function that Classifies the text into Spam or Ham', 'parameters': {'type': 'object', 'properties': {'classification_label': {'enum': ['Spam', 'Ham', 'spam', 'ham'], 'type': 'string'}, 'explanation': {'description': 'Explain why you gave that label to this text. Keep your answers short and precise. I will tip you $20 for a good explanation. ', 'type': 'string'}}}}}], 'tool_choice': {'type': 'function', 'function': {'name': 'Classification'}}})
| PydanticToolsParser(first_tool_only=True, tools=[<class '__main__.Classification'>])
few_shot_prompt.messages
[{'role': 'system',
  'content': "You are an expert in classifying a given text into Spam or Ham.\nThese are some of the examples that you can use to do this task.\n \nText: Been running but only managed 5 minutes and then needed oxygen! Might have to resort to the roller option! Label: ham\n \nText: Omg how did u know what I ate? Label: ham\n \nText: Hi here. have birth at on the  to  at 8lb 7oz. Mother and baby doing brilliantly. Label: ham\n \nText: Haha yeah, 2 oz is kind of a shitload Label: ham\n \nText: Aah! A cuddle would be lush! I'd need lots of tea and soup before any kind of fumbling! Label: ham\n \nText: Free video camera phones with Half Price line rental for 12 mths and 500 cross ntwk mins 100 txts. Call MobileUpd8 08001950382 or Call2OptOut/674 Label: spam\n \nText: U have a secret admirer who is looking 2 make contact with U-find out who they R*reveal who thinks UR so special-call on 09058094599 Label: spam\n \nText: If you don't, your prize will go to another customer. T&C at www.t-c.biz 18+ 150p/min Polo Ltd Suite 373 London W1J 6HL Please call back if busy Label: spam\n \nText: As one of our registered subscribers u can enter the draw 4 a 100 G.B. gift voucher by replying with ENTER. To unsubscribe text STOP Label: spam\n \nText: **FREE MESSAGE**Thanks for using the Auction Subscription Service. 18 . 150p/MSGRCVD 2 Skip an Auction txt OUT. 2 Unsubscribe txt STOP CustomerCare 08718726270 Label: spam"},
 {'role': 'user',
  'content': 'Extract the properties listed in Classification function : No Deposit Required. Play for FREE and Win for Real!..-ettzhr.'}]
llama_70b_llm.invoke(few_shot_prompt.messages)
AIMessage(content='The given text is a spam message. It contains several characteristics that are typical of spam messages, such as:\n\n1. Urgency: The message creates a sense of urgency by stating "No Deposit Required" and "Play for FREE and Win for Real!" which is a common tactic used by spammers to get the recipient\'s attention.\n\n2. Misleading information: The message is misleading as it claims that the recipient can win for real without making any deposit, which is likely not true.\n\n3. Use of abbreviations and symbols: The message uses abbreviations and symbols like "ettzhr" which is not a common practice in legitimate messages.\n\n4. Lack of personalization: The message does not address the recipient by name, indicating that it is a mass spam message.\n\n5. Suspicious link: The message contains a suspicious link "ettzhr" which may lead to a phishing or malware site.\n\nBased on these characteristics, the text can be classified as spam.', response_metadata={'token_usage': {'completion_tokens': 197, 'prompt_tokens': 421, 'total_tokens': 618, 'completion_time': 0.633444274, 'prompt_time': 0.030477088, 'queue_time': 0.0021507320000000024, 'total_time': 0.663921362}, 'model_name': 'llama3-groq-70b-8192-tool-use-preview', 'system_fingerprint': 'fp_ee4b521143', 'finish_reason': 'stop', 'logprobs': None}, id='run-2067f181-e53b-4ee7-bbcc-c52a038fcbf6-0', usage_metadata={'input_tokens': 421, 'output_tokens': 197, 'total_tokens': 618})
result = llama_70b_llm_cls_head.invoke(few_shot_prompt.messages)
result
Classification(classification_label='spam', explanation='The text contains promotional language and a call to action, which is typical of spam messages.')

Validation#

import time
def run_zero_shot_classification(ds, model, template, text_column= "sms", classes="Spam or Ham"):
    """
    Runs few-shot classification on a given dataset.

    Parameters:
    - ds: The dataset containing test samples.
    - model: The language model to invoke for classification.
    - template: The template string to generate prompts.
    - samples: The few-shot examples to include in the prompt.
    - classes: The classes for text classification. Default is "Spam or Ham".

    Returns:
    - A list of classified samples with labels and explanations.
    """
    zero_shot_results = []
    
    for each_sample in tqdm(ds['test']):
        template_data = {
            "text_classfication_classes": classes,
            "text": each_sample[text_column].strip(),
        }
        zero_shot_prompt = Prompt(
            raw_template=template,
            template_data=template_data
        )
        # print(zero_shot_prompt.messages)
        validation = model.invoke(zero_shot_prompt.messages)
        each_sample['classification_label'] = validation.classification_label
        each_sample['explanation'] = validation.explanation
        zero_shot_results.append(each_sample)
    
    return zero_shot_results
llama_70b_llm_cls_head
RunnableBinding(bound=ChatGroq(client=<groq.resources.chat.completions.Completions object at 0x7fdb2ba48ec0>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x7fdb2ba49820>, model_name='llama3-groq-70b-8192-tool-use-preview', temperature=1e-08, groq_api_key=SecretStr('**********')), kwargs={'tools': [{'type': 'function', 'function': {'name': 'Classification', 'description': 'Function that Classifies the text into Spam or Ham', 'parameters': {'type': 'object', 'properties': {'classification_label': {'enum': ['Spam', 'Ham', 'spam', 'ham'], 'type': 'string'}, 'explanation': {'description': 'Explain why you gave that label to this text. Keep your answers short and precise. I will tip you $20 for a good explanation. ', 'type': 'string'}}}}}], 'tool_choice': {'type': 'function', 'function': {'name': 'Classification'}}})
| PydanticToolsParser(first_tool_only=True, tools=[<class '__main__.Classification'>])
zero_shot_template  = """
- name: system instructions
  role: system
  content: |
   You are an expert in classifying a given text into {{ text_classfication_classes }}

- name: user query
  role: user
  content: |
   Please extract properies defined in Classification function from the following text.
   {{ text }} 
"""
zero_shot_results = run_zero_shot_classification(ds, llama_70b_llm_cls_head, 
                            zero_shot_template, classes="Spam or Ham")
100%|███████████████████████████████████████████████████████████████████████████████████| 56/56 [01:27<00:00,  1.56s/it]
zero_shot_results = pd.DataFrame(zero_shot_results)
zero_shot_results[zero_shot_results['classification_label'].isna()]
sms label classification_label explanation
zero_shot_results['classification_id'] = zero_shot_results['classification_label'].apply(lambda x: 1 if x == "spam" or x=="Spam" else 0)
# Acuuracy
(zero_shot_results['label'] == zero_shot_results['classification_id']).sum() / zero_shot_results.shape[0]
0.8392857142857143
zero_shot_results['label'].value_counts()
label
0    48
1     8
Name: count, dtype: int64
zero_shot_results['classification_label'].value_counts()
classification_label
Ham     39
Spam    17
Name: count, dtype: int64
zero_shot_results[zero_shot_results['label'] != zero_shot_results['classification_id']]
sms label classification_label explanation classification_id
0 Hi! You just spoke to MANEESHA V. We'd like to... 0 Spam The text is a spam message as it is a generic ... 1
4 Perhaps * is much easy give your account ident... 0 Spam The text contains a request for personal infor... 1
14 I want to lick your pussy now...\n 0 Spam The text contains explicit content and is like... 1
17 HI BABE IM AT HOME NOW WANNA DO SOMETHING? XX\n 0 Spam The text contains overly casual language and a... 1
35 See the forwarding message for proof\n 0 Spam The text appears to be a spam message as it is... 1
45 Or better still can you catch her and let ask ... 0 Spam The text contains a request to sell a product,... 1
46 How are you with moneY...as in to you...money ... 0 Spam The text contains nonsensical and irrelevant c... 1
48 That's fine, I'll bitch at you about it later ... 0 Spam The text contains aggressive language and a ne... 1
54 "NOT ENUFCREDEIT TOCALL.SHALL ILEAVE UNI AT 6 ... 0 Spam The text contains abbreviations and lacks prop... 1
def run_few_shot_classification(ds, model, template, samples, text_column= "sms", classes="Spam or Ham"):
    """
    Runs few-shot classification on a given dataset.

    Parameters:
    - ds: The dataset containing test samples.
    - model: The language model to invoke for classification.
    - template: The template string to generate prompts.
    - samples: The few-shot examples to include in the prompt.
    - classes: The classes for text classification. Default is "Spam or Ham".

    Returns:
    - A list of classified samples with labels and explanations.
    """
    few_shot_results = []
    
    for each_sample in tqdm(ds['test']):
        template_data = {
            "text_classfication_classes": classes,
            "text": each_sample['sms'].strip(),
            "samples": samples
        }
        few_shot_prompt = Prompt(
            raw_template=template,
            template_data=template_data
        )
        validation = model.invoke(few_shot_prompt.messages)
        each_sample['classification_label'] = validation.classification_label
        each_sample['explanation'] = validation.explanation
        few_shot_results.append(each_sample)
    
    return few_shot_results
print(few_shot_template)
- name: system instructions
  role: system
  content: |
   You are an expert in classifying a given text into {{ text_classfication_classes }}.
   These are some of the examples that you can use to do this task.
   {% for each_example, each_label in samples %} 
   Text: {{ each_example }} Label: {{ each_label}}
   {% endfor %}

- name: user query
  role: user
  content: |
   Extract the properties listed in Classification function : {{ escape_special_characters(text) }} 
few_shot_results = run_few_shot_classification(ds, llama_70b_llm_cls_head, 
                            few_shot_template, samples=samples, text_column= "sms", classes="Spam or Ham")
100%|███████████████████████████████████████████████████████████████████████████████████| 56/56 [01:41<00:00,  1.82s/it]
few_shot_results = pd.DataFrame(few_shot_results)
few_shot_results[few_shot_results['classification_label'].isna()]
sms label classification_label explanation
1 Ok but tell me half an hr b4 u come i need 2 p... 0 None None
31 Hey so whats the plan this sat? \n 0 None None
few_shot_results = few_shot_results[~few_shot_results['classification_label'].isna()]
few_shot_results['classification_id'] = few_shot_results['classification_label'].apply(lambda x: 1 if x == "spam" else 0)
# Acuuracy
(few_shot_results['label'] == few_shot_results['classification_id']).sum() / zero_shot_results.shape[0]
0.9285714285714286
few_shot_results[few_shot_results['label'] != few_shot_results['classification_id']].head(1)
sms label classification_label explanation classification_id
0 Hi! You just spoke to MANEESHA V. We'd like to... 0 spam The text contains a request for feedback and a... 1
few_shot_results[few_shot_results['label'] != few_shot_results['classification_id']]['sms'].iloc[0]
"Hi! You just spoke to MANEESHA V. We'd like to know if you were satisfied with the experience. Reply Toll Free with Yes or No.\n"
few_shot_results[few_shot_results['label'] != few_shot_results['classification_id']]['explanation'].iloc[0]
'The text contains a request for feedback and a toll-free number, which is a common tactic used in spam messages.'

We developed a model with more than 92% accuracy by leveraging the capabilities of LLMs to generalize (it’s probably fine tuned on this dataset as well).

Things to consider:

  • LLM has been pretrained on lot of tasks, this performance could be misleading because this task would have been trained already.

  • Are LLMs learning how we are learning? Or is it just remembering / retrieving things?

  • LLMs can return good looking answers confidently even when uncertain

Some advancements to consider:

  • Which samples to show? Can we maintain diversity of the samples?

  • Number of samples

Token Classification#

from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq
from typing import List, Literal, Annotated
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.tools import tool
import getpass
import numpy as np
from datasets import load_dataset
from prompt_poet import Prompt
from tqdm import tqdm
import pandas as pd
from devtools import pprint
from typing import List, Optional
groq_api_key = getpass.getpass()
MaxLengthStr = Annotated[str, Field(max_length=20)]

class JobDescriptionExtraction(BaseModel):
    job_title: str = Field(description="The Job title of the text, keep the title short, do not include any locations or other details apart from title")
    tech_skills: List[MaxLengthStr] = Field(description="Technical Skills mentioned in the text")
    soft_skills: List[MaxLengthStr] = Field(description="Soft Skills mentioned in the text")
    certifications: List[MaxLengthStr] = Field(description="Certifications mentioned in the text")
    locations: List[str] = Field(description="Geographical Locations mentioned in Job Description if any. Otherwise return an empty list")

Let’s say we want to extract certain entities from raw text. The generic task of such type could be formulated as as token classification problem.

Eg: POS tagging, Named entity recognition etc.

Traditional Methods#

  • Creation of data labels at a token or word level

  • For multiword phrases, creating Begin-Inside-End tokens

  • Training sequential models HMM, LSTM etc.

  • Labelling is an exhaustive effort.

  • Feature engineering is necessary.

  • Generalization beyond the domain in which model is trained for is not possible.

  • Distributional learning methods like Word2vec help but not by a lot

  • Lot of applications in the industry:

  • Aspect level sentiment analysis: Product - Feature - Sentiment

  • Getting structured data from unstructured text

As an example, let’s see if we can use a pretrained LLM to directly extract entities from a Job description using Zero-shot approach.

raw_template = """
- name: system instructions
  role: system
  content: |
    You are an expert in classifying a given text into a job title and extracting properties defined in the JobDescriptionExtraction function. 
    Do not respond with anything other than the text mentioned in the text.

- name: user query
  role: user
  content: |
    Please extract properties defined in the JobDescriptionExtraction function: 
    {{ escape_special_characters(text) }}
"""
template_data = {"text" : '''
Data Scientist VP - Chief Data Office India, Bengaluruor Mumbai

Description

As a Data Scientist with the Chief Data Office, you will shape the future of the Chief Administrative Office and its businesses by applying world-class machine learning expertise. You will collaborate on a wide array of product and business problems with a diverse set of cross-functional partners across Finance, Supplier Services, Data Security intelligence program, Global Real Estate and Customer Experience. You will use data and analysis to identify and solve our divisions biggest challenges and develop state-of-the art machine learning models to solve real-world problems. We have evolved from our ‘startup’ roots to become a credible strategic partner trusted by division wide leadership and are expanding now. By joining JP Morgan Chief Data Office (CAO), you will become part of a world-class Data science community dedicated to problem solving and career growth in ML/AI discipline and beyond.

Product Owner: Develop and own ML products to drive business outcomes and influence your strategic partners, in a highly collaborative environment
Research & Learning: The candidate must also have a strong passion for machine learning and invest independent time towards learning, researching, and experimenting with new innovations in the field.
Problem-Solving: We want a strategic thinker with demonstrated problem-solving skills using Machine Learning Skills.

Technical Skills

Master’s in quantitative field (Computer Science, Mathematics, Statistics, or ML)
6-8 years industry experience in data science / applied ML model development (must have)
Strong knowledge and experience with Traditional ML, Deep Learning,LLM, NLP, time-series predictions, or recommendation systems (must have)
Excellent python coding and algorithm skills (must have)
Experience with data visualization techniques and software
Foundational Statistics knowledge

Additional Skills

Experience driving AI adoption
Experience with Data Querying (e.g., SQL, big data), A/B Testing
Experience with Cloud based deployment (e.g., aws, azure), Engineering background
Experience with python frameworks (e.g., pyspark, django, Flask, Bottle)
Experience in financial markets or services firm

ABOUT US

JPMorgan Chase & Co., one of the oldest financial institutions, offers innovative financial solutions to millions of consumers, small businesses and many of the world’s most prominent corporate, institutional and government clients under the J.P. Morgan and Chase brands. Our history spans over 200 years and today we are a leader in investment banking, consumer and small business banking, commercial banking, financial transaction processing and asset management.

We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicants’ and employees’ religious practices and beliefs, as well as mental health or physical disability needs. Visit our FAQs for more information about requesting an accommodation.

About The Team

Our professionals in our Corporate Functions cover a diverse range of areas from finance and risk to human resources and marketing. Our corporate teams are an essential part of our company, ensuring that we’re setting our businesses, clients, customers and employees up for success.

'''}
llm_job_extraction = ChatGroq(temperature=0, model_name="llama3-groq-70b-8192-tool-use-preview",
                              api_key=groq_api_key).with_structured_output(JobDescriptionExtraction)
prompt = Prompt(
    raw_template=raw_template,
    template_data=template_data
)
result = llm_job_extraction.invoke(prompt.messages)
pprint(result)
JobDescriptionExtraction(
    job_title='Data Scientist VP - Chief Data Office',
    tech_skills=[
        'Machine Learning',
        'Deep Learning',
        'LLM',
        'NLP',
        'time-series predictions',
        'recommendation systems',
        'Python',
        'Data Querying',
        'SQL',
        'big data',
        'Cloud based deployment',
        'aws',
        'azure',
        'Engineering background',
        'Python frameworks',
        'pyspark',
        'django',
        'Flask',
        'Bottle',
    ],
    soft_skills=[
        'Strategic thinker',
        'Problem-solving',
        'Collaboration',
        'Innovation',
        'Adoption',
        'Leadership',
    ],
    certifications=[],
    locations=[
        'India',
        'Bengaluru',
        'Mumbai',
    ],
)

Aspect level Sentiment#

raw_template = """
- name: system instructions
  role: system
  content: |
    You are an expert in identifying the sentiment of the review of a product into a positive, negative or neutral and extracting properties defined in the AspectLevelSentiments function. 
    Do not respond with anything other than the text mentioned in the text.

- name: user query
  role: user
  content: |
    Please extract properties defined in the AspectLevelSentiments function: 
    {{ escape_special_characters(text) }}
"""
template_data = {"text": '''
Pros
1. Very good looking, especially the Oasis green variant.
2. Very smooth without any stutters
3. No heating in normal use and I am not a gamer.
4. Longer software updates
5. Good display and charges on 32 mins.
Cons
1. Average cameras
2. Display should have been better in outdoor brightness.
3. Battery drains faster even in power saver mode. Lasts only a day with average normal usage.
4. Software experience in oxygen OS has been degraded and with some bugs.

Overall an above average experience with the Nord 4.
'''}
class AspectLevelSentiment(BaseModel):
    product_aspect: Optional[str] = Field(description="What aspect of the product is the user talking about")
    sentiment: Optional[str]= Field(enum=["positive","negative","neutral"])
    sentiment_term: Optional[str]= Field(description="term used to describe the sentiment on the aspect")

class AspectLevelSentiments(AspectLevelSentiment):
    Sentiments: List[AspectLevelSentiment]
prompt = Prompt(
    raw_template=raw_template,
    template_data=template_data
)
llm_aspect_sentiment = ChatGroq(temperature=0, model_name="llama3-groq-70b-8192-tool-use-preview"
                                ,api_key=groq_api_key).with_structured_output(AspectLevelSentiments)
result = llm_aspect_sentiment.invoke(prompt.messages)
pprint(result)
AspectLevelSentiments(
    product_aspect=None,
    sentiment=None,
    sentiment_term=None,
    Sentiments=[
        AspectLevelSentiment(
            product_aspect='appearance',
            sentiment='positive',
            sentiment_term='good looking',
        ),
        AspectLevelSentiment(
            product_aspect='performance',
            sentiment='positive',
            sentiment_term='smooth',
        ),
        AspectLevelSentiment(
            product_aspect='battery life',
            sentiment='positive',
            sentiment_term='no heating',
        ),
        AspectLevelSentiment(
            product_aspect='software updates',
            sentiment='positive',
            sentiment_term='longer',
        ),
        AspectLevelSentiment(
            product_aspect='display',
            sentiment='positive',
            sentiment_term='good',
        ),
        AspectLevelSentiment(
            product_aspect='battery life',
            sentiment='negative',
            sentiment_term='drains faster',
        ),
        AspectLevelSentiment(
            product_aspect='battery life',
            sentiment='negative',
            sentiment_term='lasts only a day',
        ),
        AspectLevelSentiment(
            product_aspect='software experience',
            sentiment='negative',
            sentiment_term='degraded',
        ),
        AspectLevelSentiment(
            product_aspect='software experience',
            sentiment='negative',
            sentiment_term='bugs',
        ),
        AspectLevelSentiment(
            product_aspect='cameras',
            sentiment='negative',
            sentiment_term='average',
        ),
        AspectLevelSentiment(
            product_aspect='display',
            sentiment='negative',
            sentiment_term='should have been better in outdoor brightness',
        ),
    ],
)