Aileana - Data Extraction

**The data extraction and classification framework for the Aileana application found here **

**Aileana** excels at identifying and correlating *jobs* , *skills* , *requirements* , *benefits* , *required experience* , and *responsibilities* . Utilizing agentic workflows from state-of-the-art LLMs and Retrieval-Augmented Generation (RAG), she enhances accuracy and reliability with data sourced from recent and relevant information. The Knowledge Graph database allows the models to perform more detailed queries and deeper analysis than traditional vector embeddings in SQL databases.

⚡️Tech Stack

Category	Technology
Frontend
Backend
Databases
Web Scraping
Data Processing
DevOps
Testing
LLM Frameworks

Note: Feel free to can use the requirements.txt to pip install all dependancies in the project environment.

📕 The Process

🔮 How the Magic Happens

Here is an overview of the main processes that take place to achieve the end result:

The system scrapes job listing websites for (new) jobs.
Translates all scraped listings into English (if not already in English 🇬🇧) .
Stores all scraped listings in a PostgreSQL database.
The unstructured data is processed by the LLM to extract valuable information in a structured format.
Using the structured data, predefined nodes, attributes, and their relationships are stored in the Neo4j graph database.
Data cleaning for the node labels and attributes.
Creates vector embeddings and an index in the graph database for use in RAG.
RAG is used to ground a conversational LLM (avoiding hallucinations) to assist users with related questions.
In addition to written responses, LLMs create charts/plots visualizing data based on user prompts.

🕷️ Web Scraping

To scrape job listings, I primarily used Beautiful Soup 4 and Selenium to navigate through websites and extract each job listing. Job listing websites typically have straightforward, repetitive layouts, making information extraction easy-peasy.

💽 SQL Database

After gathering and translating all the job data, it was stored in a PostgreSQL database. This database serves as a single source of truth before further processing the unstructured data included in the job descriptions. The database schema is simple, with just one table and lots of columns 😜.

🤖 Knowledge Graph Database

This was the most interesting part for a coding newbie like me.

Each job listing stored in the PostgreSQL database is parsed using an LLM to populate the Knowledge Graph database Neo4J based on the schema below.

After importing all jobs into the Knowledge Graph, another LLM process is used for rough data analysis and cleaning (e.g., removing duplicate entries or similar data in other nodes).

Finally, embeddings for each node and a vector index are created and updated after each new batch of scraped listings.

International Standards:

For more accurate data analysis, I decided to adopt a few International Standards:

NACE (Nomenclature of Economic Activities) V2 - A European standard for classification of economic activities.
ISCED (2011) - Levels of education - A framework for categorizing levels of education into seven levels.
ISCO-88 Occupation Titles - An International Standard Classification of Occupations that groups jobs into four levels of aggregation.

This will help with the accuracy of data analysis of different types of jobs posted by companies in various industries in correlation with their level of education and related skills.

The Knowledge Graph schema

Nodes:

INDUSTRY CATEGORY

Label: INDUSTRY_CATEGORY
Property Keys:
- industry_name: The industry under which the company posted the listing.
- standardized_industry_name: Standardized industry type based on NACE (Nomenclature of Economic Activities) V2.

JOB TITLE

Label: JOB_TITLE
Property Keys:
- job_title: The job title as mentioned in the job listing.
- standardized_occupation: The standardized occupation based on ISCO-88.
- job_seniority: Internship, Entry, Junior, Mid, or Senior level (if mentioned).
- minimum_level_of_education: The minimum level of education required, based on ISCED (2011).
- external_id: The ID used by the job listing website (if applicable).
- employment_type [optional]: Full-time, Part-time, etc. (if mentioned).
- employment_model [optional]: On-site, Remote, Hybrid, or any other employment model (if mentioned).

SKILLS

Label: SKILL
Property Keys:
- skill_category: Soft or Hard skill.
- skill_name: Name of the skill.
- skill_type [optional]: Academic Skill, Technical Skill, Knowledge of a Software tool, Professional Certification, Personality Attribute, Fluency in a Language, or any other skill.

BENEFITS

Label: BENEFIT
Property Keys:
- benefit_name: Days of annual leave, Health Insurance (Private, Public, or both), Provident Fund, Amenities, or any other benefit.

EXPERIENCE

Label: EXPERIENCE
Property Keys:
- years_required: Whether previous experience is required (boolean) (if mentioned).
- minimum_years [optional]: The minimum number of years needed (integer) (if mentioned).

RESPONSIBILITIES

Label: RESPONSIBILITY
Property Keys:
- description: A minimal summary of each responsibility requested in the job listing.

Node Relationships:

INDUSTRY_CATEGORY |POSTS| JOB_TITLE
JOB_TITLE |NEEDS| SKILL
JOB_TITLE |REQUIRES| EXPERIENCE
JOB_TITLE |OFFERS| BENEFITS
JOB_TITLE |HAS| RESPONSIBILITY
RESPONSIBILITY |RELATES_TO| SKILL

🚀 Conclusions

⚡️The Tech

Imagine the dynamic duo of GROQ with its lightning-fast speeds 🚀💨 paired with the incredible reasoning prowess of LLama 3 70B. This powerhouse combination makes extracting key information from job listings a breeze! It was so fast and easy that I even used LLMs for simple tasks like text translations. After cross-checking the response from OpenAI's Chat-GPT4o, the results were surprisingly close, making it a no-brainer due to the cost ( and speed! ) difference.

While one database could suffice, I initially planned to use the PGVector add-on for PostgreSQL. However, after parsing a few thousand listings 😅, I realized leveraging LLMs (Large Language Models) for data extraction was a brilliant move for a data analysis project. This approach becomes even more exciting when combined with a Knowledge Graph database like Neo4J - it's a dream come true for my inner data geek 🔍!

After some research, I found that the embeddings model from OpenAI provided excellent results, though it came with a cost of vectorizing thousands of job listings ▿️.

💭A Few Thoughts

This project is my cool experiment to see if we can really put Large Language Models (LLMs) and agentic frameworks like LangChain and CrewAI into production. Spoiler alert: it's a wild ride, flaws and all!

Sure, these tools are reliable...ish. They're consistent...ish. But are they perfect? Not quite. Sometimes, even with simple text, different models can give you wildly different results 🔍.

Designing an LLM-based solution isn't just plug-and-play. You've got to juggle tokens, output formats (think JSON), and the cost and reliability of parameters like temperature. And don't even get me started on prompt engineering – it's an art form, not a science 😎!

Oh, and did I mention tools? LLMs can now use them, but in reality, creating custom tools can sometimes lead to a bigger codebase. Sometimes, a simple script does the trick better.

Now, let's talk unit testing. Setting up self-checking for LLMs can seriously boost response accuracy. But here's the catch: writing these self-checking methods is a whole other ball game because every agent and every API use case is different.... be prepared for a cost hike and a flurry of API calls 💸.

On the other hand, when you need to go through hundreds or thousands of paragraphs to extract valuable information, you need brains... and lots of them 🧠🧠🧠.

LLMs are the only solution for such tasks at scale!

The Takeaway

So, what's the takeaway? The fewer decisions a model has to make, the more reliable it will be. But that also means more API calls and, yep, higher costs.

if you went that far, I hope you enjoyed this tech adventure🎉!

Love, Dimitris

Resources and Inspiration

JohannesJolkkonen: Knowledge Graph + Pythoon GitHub)

Self-Reflective RAG with LangGraph (langchain.dev)

Whats next for AI agentic workflows ft. Andrew Ng of AI Fund

Going Meta - Ep 27: Building a Reflection Agent with LangGraph

Convert any Text Data into a Knowledge Graph (using LLAMA3 + GROQ)

Websraping alternative Scrapegraph.ai

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
LICENSE		LICENSE
README.md		README.md
aileana_helper_SQL_queries.sql		aileana_helper_SQL_queries.sql
app.py		app.py
error_log.txt		error_log.txt
helper_llm_main.py		helper_llm_main.py
helpers_other.py		helpers_other.py
helpers_scrape.py		helpers_scrape.py
helpers_sqldb.py		helpers_sqldb.py
helpers_translation_ai.py		helpers_translation_ai.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Aileana - Data Extraction

⚡️Tech Stack

📕 The Process

🔮 How the Magic Happens

🕷️ Web Scraping

💽 SQL Database

🤖 Knowledge Graph Database

🚀 Conclusions

⚡️The Tech

💭A Few Thoughts

The Takeaway

About

Uh oh!

Releases

Packages

Languages

License

DimKouts84/aileana-data-backend

Folders and files

Latest commit

History

Repository files navigation

Aileana - Data Extraction

⚡️Tech Stack

📕 The Process

🔮 How the Magic Happens

🕷️ Web Scraping

💽 SQL Database

🤖 Knowledge Graph Database

🚀 Conclusions

⚡️The Tech

💭A Few Thoughts

The Takeaway

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages