**The data extraction and classification framework for the Aileana application found here **
**Aileana** excels at identifying and correlating *jobs* , *skills* , *requirements* , *benefits* , *required experience* , and *responsibilities* . Utilizing agentic workflows from state-of-the-art LLMs and Retrieval-Augmented Generation (RAG), she enhances accuracy and reliability with data sourced from recent and relevant information. The Knowledge Graph database allows the models to perform more detailed queries and deeper analysis than traditional vector embeddings in SQL databases.
| Category | Technology |
|---|---|
| Frontend | |
| Backend | |
| Databases | |
| Web Scraping | |
| Data Processing | |
| DevOps | |
| Testing | |
| LLM Frameworks |
Note: Feel free to can use the requirements.txt to pip install all dependancies in the project environment.
Here is an overview of the main processes that take place to achieve the end result:
- The system scrapes job listing websites for (new) jobs.
- Translates all scraped listings into English (if not already in English 🇬🇧) .
- Stores all scraped listings in a
PostgreSQLdatabase. - The unstructured data is processed by the LLM to extract valuable information in a structured format.
- Using the structured data, predefined
nodes,attributes, and theirrelationshipsare stored in theNeo4jgraph database. - Data cleaning for the node
labelsandattributes. - Creates vector
embeddingsand anindexin the graph database for use in RAG. RAGis used to ground a conversational LLM (avoiding hallucinations) to assist users with related questions.- In addition to written responses, LLMs create charts/plots visualizing data based on user prompts.
To scrape job listings, I primarily used Beautiful Soup 4 and Selenium to navigate through websites and extract each job listing. Job listing websites typically have straightforward, repetitive layouts, making information extraction easy-peasy.
After gathering and translating all the job data, it was stored in a PostgreSQL database. This database serves as a single source of truth before further processing the unstructured data included in the job descriptions. The database schema is simple, with just one table and lots of columns 😜.
This was the most interesting part for a coding newbie like me.
Each job listing stored in the PostgreSQL database is parsed using an LLM to populate the Knowledge Graph database Neo4J based on the schema below.
After importing all jobs into the Knowledge Graph, another LLM process is used for rough data analysis and cleaning (e.g., removing duplicate entries or similar data in other nodes).
Finally, embeddings for each node and a vector index are created and updated after each new batch of scraped listings.
International Standards:
For more accurate data analysis, I decided to adopt a few International Standards:
- NACE (Nomenclature of Economic Activities) V2 - A European standard for classification of economic activities.
- ISCED (2011) - Levels of education - A framework for categorizing levels of education into seven levels.
- ISCO-88 Occupation Titles - An International Standard Classification of Occupations that groups jobs into four levels of aggregation.
This will help with the accuracy of data analysis of different types of jobs posted by companies in various industries in correlation with their level of education and related skills.
The Knowledge Graph schema
Nodes:
INDUSTRY CATEGORY
- Label:
INDUSTRY_CATEGORY - Property Keys:
industry_name: The industry under which the company posted the listing.standardized_industry_name: Standardized industry type based on NACE (Nomenclature of Economic Activities) V2.
JOB TITLE
- Label:
JOB_TITLE - Property Keys:
job_title: The job title as mentioned in the job listing.standardized_occupation: The standardized occupation based on ISCO-88.job_seniority: Internship, Entry, Junior, Mid, or Senior level (if mentioned).minimum_level_of_education: The minimum level of education required, based on ISCED (2011).external_id: The ID used by the job listing website (if applicable).employment_type[optional]: Full-time, Part-time, etc. (if mentioned).employment_model[optional]: On-site, Remote, Hybrid, or any other employment model (if mentioned).
SKILLS
- Label:
SKILL - Property Keys:
skill_category: Soft or Hard skill.skill_name: Name of the skill.skill_type[optional]: Academic Skill, Technical Skill, Knowledge of a Software tool, Professional Certification, Personality Attribute, Fluency in a Language, or any other skill.
BENEFITS
- Label:
BENEFIT - Property Keys:
benefit_name: Days of annual leave, Health Insurance (Private, Public, or both), Provident Fund, Amenities, or any other benefit.
EXPERIENCE
- Label:
EXPERIENCE - Property Keys:
years_required: Whether previous experience is required (boolean) (if mentioned).minimum_years[optional]: The minimum number of years needed (integer) (if mentioned).
RESPONSIBILITIES
- Label:
RESPONSIBILITY - Property Keys:
description: A minimal summary of each responsibility requested in the job listing.
Node Relationships:
INDUSTRY_CATEGORY|POSTS|JOB_TITLEJOB_TITLE|NEEDS|SKILLJOB_TITLE|REQUIRES|EXPERIENCEJOB_TITLE|OFFERS|BENEFITSJOB_TITLE|HAS|RESPONSIBILITYRESPONSIBILITY|RELATES_TO|SKILL
Imagine the dynamic duo of GROQ with its lightning-fast speeds 🚀💨 paired with the incredible reasoning prowess of LLama 3 70B. This powerhouse combination makes extracting key information from job listings a breeze! It was so fast and easy that I even used LLMs for simple tasks like text translations. After cross-checking the response from OpenAI's Chat-GPT4o, the results were surprisingly close, making it a no-brainer due to the cost ( and speed! ) difference.
While one database could suffice, I initially planned to use the PGVector add-on for PostgreSQL. However, after parsing a few thousand listings 😅, I realized leveraging LLMs (Large Language Models) for data extraction was a brilliant move for a data analysis project. This approach becomes even more exciting when combined with a Knowledge Graph database like Neo4J - it's a dream come true for my inner data geek 🔍!
After some research, I found that the embeddings model from OpenAI provided excellent results, though it came with a cost of vectorizing thousands of job listings ▿️.
This project is my cool experiment to see if we can really put Large Language Models (LLMs) and agentic frameworks like LangChain and CrewAI into production. Spoiler alert: it's a wild ride, flaws and all!
Sure, these tools are reliable...ish. They're consistent...ish. But are they perfect? Not quite. Sometimes, even with simple text, different models can give you wildly different results 🔍.
Designing an LLM-based solution isn't just plug-and-play. You've got to juggle tokens, output formats (think JSON), and the cost and reliability of parameters like temperature. And don't even get me started on prompt engineering – it's an art form, not a science 😎!
Oh, and did I mention tools? LLMs can now use them, but in reality, creating custom tools can sometimes lead to a bigger codebase. Sometimes, a simple script does the trick better.
Now, let's talk unit testing. Setting up self-checking for LLMs can seriously boost response accuracy. But here's the catch: writing these self-checking methods is a whole other ball game because every agent and every API use case is different.... be prepared for a cost hike and a flurry of API calls 💸.
On the other hand, when you need to go through hundreds or thousands of paragraphs to extract valuable information, you need brains... and lots of them 🧠🧠🧠.
LLMs are the only solution for such tasks at scale!
So, what's the takeaway? The fewer decisions a model has to make, the more reliable it will be. But that also means more API calls and, yep, higher costs.
if you went that far, I hope you enjoyed this tech adventure🎉!
Love, Dimitris
Resources and Inspiration
JohannesJolkkonen: Knowledge Graph + Pythoon GitHub)
Self-Reflective RAG with LangGraph (langchain.dev)
Whats next for AI agentic workflows ft. Andrew Ng of AI Fund
Going Meta - Ep 27: Building a Reflection Agent with LangGraph
Convert any Text Data into a Knowledge Graph (using LLAMA3 + GROQ)
Websraping alternative Scrapegraph.ai