From 4597d8c775c0ae05f48e3522fa84943d1bd0a75e Mon Sep 17 00:00:00 2001 From: Jagraj Aulakh Date: Sun, 12 Mar 2023 14:30:55 -0400 Subject: [PATCH] Update README with setup instructions and added requirements.txt --- .gitignore | 1 + README.md | 38 ++++++++++++++++++++++++++++++++++++++ requirements.txt | 5 +++++ 3 files changed, 44 insertions(+) create mode 100644 requirements.txt diff --git a/.gitignore b/.gitignore index 5d381cc..6cec18d 100644 --- a/.gitignore +++ b/.gitignore @@ -160,3 +160,4 @@ cython_debug/ # option (not recommended) you can uncomment the following to ignore the entire idea folder. #.idea/ +graph.txt diff --git a/README.md b/README.md index 3ac8516..eb4b5b0 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,40 @@ # comp-4800-web-crawler +This program will generate an undirected graph similar to web-Google. Given any starting +website, the program will parse any links on the website, and reursively find more +websites by visiting the pased links. + +**NOTE: Be careful with this program, it send GET requests to the parsed websites. If you +send too many requests to the same website, they may block your IP address.** + +# How to run + +Make a virtual environment: + +```bash +python -m venv venv +``` + +Activate: + +```bash +source venv/bin/activate +``` + +Install dependencies + +```bash +pip install -r reqirements.txt +``` + +Run the program, giving a starting website. + +```bash +python main.py jagrajaulakh.com +``` + +View the outputted graph: + +```bash +cat graph.txt +``` diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..ea9d77c --- /dev/null +++ b/requirements.txt @@ -0,0 +1,5 @@ +certifi==2022.12.7 +charset-normalizer==3.1.0 +idna==3.4 +requests==2.28.2 +urllib3==1.26.15