
Problem Statement:
The challenge is to develop a system that scrapes scheme data from myscheme.gov.in, extracts scheme names and eligibility criteria, and parses key parameters—gender, age, income, and caste—specifying them if mentioned (e.g., “Gender: Female”) or marking as “Not Specified” otherwise, to enable quick and accurate eligibility Extraction checks for citizens and administrators.
Solution To Problem:
Scheme Extraction:
We developed a script using Selenium to scrape scheme names and eligibility criteria from the MyScheme website. We have automated the process:
- Automation Tool: We employed the Selenium library to control a web browser and navigate to the target pages.
- Data Scraping: A webdriver was used to identify and extract specific text for scheme names and eligibility criteria by using their unique XPath selectors
- Data Export: The extracted information was then cleanly transferred and saved into a structured Excel sheet
Eligibility Parameters Extraction Approaches:
1) Natural language Processing:
We used Natural Language Processing (NLP) to extract gender , age, income, and caste—from the eligibility criteria text. The NLP techniques:
- Objective: To automatically identify and pull data related to gender, age, income, and caste.
- Methods Used:
1. Regular Expression (Regex) were applied to accurately retrieve numerical values for age and income.
2. Named Entity Recognition (NER) was implemented as a broader strategy to detect and label all the required demographic entities within the text
Challenges with this approach:
Accuracy is low and complex to implement.
2) Fine Tuning LLM:
We fine-tuned the Qwen2.5-0.5B-Instruct model to extract gender, age, income, and caste from text
Here’s a breakdown of the project to fine-tune a model for demographic data extraction:
- Model: Qwen2.5-0.5B-Instruct
- Objective: To perform structured extraction of gender, age, income, and caste.
- Dataset: A custom-built, instruction-based dataset containing 1,000 text-output pairs.
- Technical Stack:
1. torch: The core deep learning framework.
2. transformers: For base model loading and integration
3. peft: To implement parameter-efficient fine-tuning for memory and compute savings.
4. trl: For executing the supervised fine-tuning training loop.
Challenges with this approach:
- We have rejected this model because it did not retrieve the data accurately.
- It is a time taking process.
- We needed more datasets to fine tune this model
3) Mistral 7B LLM:
We used the Mistral 7B model, run locally via Ollama, to extract gender, age, and income from text.
Here’s a breakdown of the project to extract demographic data using a local LLM:
- Core Model: Mistral 7B.
- Local Deployment with Ollama: Employed Ollama to easily package and serve the Mistral 7B model on a local machine.
- Toolkit & Workflow:
1.Pandas: Utilized for structuring and managing the input text and the extracted output data.
2.Regular Expressions (regex): Applied to the model’s raw output to perform final cleaning, validation, and formatting.
Challenges with this approach:
Model is not retrieving the data accurately.
4) Gemini Flash 1.5
We have successfully used the Gemini 1.5 Flash model to accurately extract eligibility parameters from text.
Here is a summary of the project focused on automated data extraction:
- Core Model: Google’s Gemini 1.5 Flash.
- API & Integration:
1. Authenticated using an API key obtained from Google AI Studio.
2. Interfaced with the model via the official google.generativeai Python SDK. - Data Handling: The pandas library was used to organize the extracted information into a structured format (e.g., a DataFrame) for analysis.
- Performance: The model performed very well.
- Limitations:
Access to the service is heavily constrained.
5) Llama-3.1-8b-instant
We used this model to extract the eligibility parameters from the eligibility criteria text from the excel file.
Here is a summary of the project focused on automated data extraction:
- Inference Engine: Utilized the Groq API to access a state-of-the-art language model optimized for speed.
- Technical Stack:
1. groq: The official Python client was used to manage API requests for real-time processing.
2. pandas: Employed to structure the model’s JSON output into a clean and usable DataFrame. - Performance: The solution was highly successful, with the model demonstrating excellent accuracy and completeness in its extractions. Even it has high limits.
Conclusion:
We chose the Llama-3.1-8B-Instant model for its excellent accuracy, fast performance, and higher usage limits. It delivers reliable results across complex tasks while maintaining efficiency.
This balance of precision and scalability makes it ideal for my needs