Skip to content

kilinkis/ai-html-crawler

Repository files navigation

AI Content Extractor

A Next.js application for extracting and processing web content with AI. This project intelligently extracts meaningful content from web pages and provides AI-powered analysis, all with a sleek user interface.

Features

  • Extract clean, meaningful content from any URL
  • Automatically remove navigation bars, ads, footers, and other clutter
  • Process the extracted content with AI based on custom instructions
  • Simple, in-memory operations with no database dependencies
  • Two-stage AI processing for efficient and high-quality results

Highlights:

  1. copilot-instructions file
  2. prompt var in the route files
  3. instructions input that can translate for example

Tech Stack

Getting Started

First, install the dependencies:

npm install
# or
yarn install
# or
pnpm install

Next, create a .env.local file in the root directory with your Google AI API key:

GOOGLE_AI_API_KEY=your-api-key-here

You can obtain an API key from Google AI Studio.

Then, run the development server:

npm run dev
# or
yarn dev
# or
pnpm dev

Open http://localhost:3000 with your browser to see the result.

How to Use

  1. Enter a valid URL in the input field.
  2. Enter instructions for the AI in the second input field (e.g., "Summarize this content", "Extract key facts").
  3. Click the "Analyze Content" button.
  4. The application will:
    • Fetch the webpage content
    • Use AI to extract the meaningful parts of the page
    • Process the extracted content according to your instructions
  5. Toggle between the extracted content and AI-processed results using the tabs.

How It Works

The application uses a two-stage AI process:

  1. Content Extraction: The first AI stage analyzes the raw HTML from the webpage and extracts only the meaningful content, removing navigation, ads, footers, and other clutter.

  2. Content Processing: The second AI stage takes the extracted content and processes it according to your specific instructions.

This approach provides cleaner, more focused results by allowing the AI to work with already filtered content.

Project Structure

  • src/app/page.tsx: The main page component with the UI for URL input, AI instructions, and content display.
  • src/app/api/scrape/route.ts: The API route that handles both HTML scraping and initial AI extraction.
  • src/app/api/process/route.ts: The API route that processes the extracted content with Google's Generative AI.
  • public/: Static assets.
  • .github/copilot-instructions.md: Instructions for GitHub Copilot on how to assist with this project.

Development

This project is set up with TypeScript, Tailwind CSS, and ESLint for a modern development experience.

To build the project:

npm run build
# or
yarn build
# or
pnpm build

To start the production server:

npm start
# or
yarn start
# or
pnpm start

About

Next.js application that uses Google Generative AI to extract meaningful content from any webpage and process it based on your custom instructions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors