Files
comp-4800-web-crawler/README.md
2023-03-12 14:37:08 -04:00

47 lines
911 B
Markdown

# comp-4800-web-crawler
This program will generate an undirected graph similar to web-Google. Given any starting
website, the program will parse any links on the website, and reursively find more
websites by visiting the pased links.
**NOTE: Be careful with this program, it send GET requests to the parsed websites. If you
send too many requests to the same website, they may block your IP address.**
# How to run
Make a virtual environment:
```bash
python -m venv venv
```
Activate:
```bash
source venv/bin/activate
```
Install dependencies
```bash
pip install -r reqirements.txt
```
Run the program, giving a starting website.
```bash
python main.py jagrajaulakh.com
```
View the outputted graph:
```bash
cat graph.txt
```
# TODO
We can use `pyppeteer` or `playwright` to parse dynamically rendered websites.
[Link to article](https://scrapingant.com/blog/scrape-dynamic-website-with-python)