Super Scraper

Super Scraper is a modern web scraping solution built with FastAPI, Next.js, and LangChain. It allows users to scrape static and dynamic web pages, crawl multiple pages, generate scraping code using OpenAI, and store scraped data in DOC or Excel files. The frontend is built using Next.js and styled with Tailwind CSS for a modern user interface.

Features

Scrape static and dynamic web pages
Crawl through multiple pages and follow links
Generate scraping code using OpenAI's GPT-4
Store scraped data in DOC or Excel files
Perform advanced language model tasks with LangChain
Modern and responsive frontend built with Next.js and Tailwind CSS

Prerequisites

Python 3.7+
Node.js 12+
Chrome browser (for Selenium)

Installation

Backend Setup

Clone the repository:

git clone https://github.com/your-username/super-scraper.git
cd super-scraper

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate   # On Windows, use `venv\Scripts\activate`

Install the required Python packages:
```
pip install -r requirements.txt
```

Set up your OpenAI API key:

export OPENAI_API_KEY="your-openai-api-key"

Run the FastAPI server:
```
uvicorn main:app --reload
```

Frontend Setup

Navigate to the frontend directory:
```
cd frontend
```
Install the required Node.js packages:
```
npm install
```
Start the Next.js development server:
```
npm run dev
```

Usage

Open your browser and navigate to http://localhost:3000.
Use the forms on the main page to:
- Scrape a webpage
- Crawl a website
- Generate scraping code
- Perform a LangChain task
View the results in the Results section.

Project Structure

super-scraper/
│
├── backend/
│   ├── main.py              # FastAPI server
│   ├── requirements.txt     # Python dependencies
│   └── ...
│
├── frontend/
│   ├── components/
│   │   └── Form.js          # Form component for user input
│   ├── pages/
│   │   └── index.js         # Main page
│   ├── public/              # Public assets
│   ├── styles/
│   │   └── globals.css      # Global styles
│   ├── tailwind.config.js   # Tailwind CSS configuration
│   ├── postcss.config.js    # PostCSS configuration
│   ├── package.json         # Node.js dependencies
│   └── ...
│
└── README.md                # Project documentation

Detailed Explanation

Backend (`main.py`)

FastAPI: A modern, fast (high-performance) web framework for building APIs with Python 3.6+.
Scrapy: An open-source and collaborative web crawling framework for Python.
Selenium: A portable framework for testing web applications.
BeautifulSoup: A library for parsing HTML and XML documents.
OpenAI: Integration for generating code using GPT-4.
LangChain: Framework for language model tasks.
python-docx: A library for creating and updating Microsoft Word (.docx) files.
pandas: A powerful data manipulation library, used here to create Excel files.
BackgroundTasks: Used for saving scraped data asynchronously.

Frontend (`Next.js`)

Next.js: A React framework for production, which makes it easy to build server-rendered React applications.
Tailwind CSS: A utility-first CSS framework for rapid UI development.

Forms (`Form.js`)

A reusable form component that posts data to different endpoints (scrape, crawl, generate_scraper, and langchain_task).

Main Page (`index.js`)

Contains forms for scraping, crawling, generating scraper code, and performing LangChain tasks.
Displays results in a formatted JSON view.

Running the Application

Start the FastAPI Server:
```
uvicorn main:app --reload
```
Start the Next.js Development Server:
```
npm run dev
```
Access the Frontend:

Open your browser and navigate to http://localhost:3000.

Contributing

Fork the repository.
Create a new branch (git checkout -b feature-branch).
Make your changes.
Commit your changes (git commit -m 'Add some feature').
Push to the branch (git push origin feature-branch).
Open a pull request.

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Super Scraper

Features

Prerequisites

Installation

Backend Setup

Frontend Setup

Usage

Project Structure

Detailed Explanation

Backend (`main.py`)

Frontend (`Next.js`)

Forms (`Form.js`)

Main Page (`index.js`)

Running the Application

Contributing

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
backend		backend
frontend		frontend
README.md		README.md

ranjeetds/super-scraper

Folders and files

Latest commit

History

Repository files navigation

Super Scraper

Features

Prerequisites

Installation

Backend Setup

Frontend Setup

Usage

Project Structure

Detailed Explanation

Backend (main.py)

Frontend (Next.js)

Forms (Form.js)

Main Page (index.js)

Running the Application

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Backend (`main.py`)

Frontend (`Next.js`)

Forms (`Form.js`)

Main Page (`index.js`)

Packages