multithreading #2
							
								
								
									
										24
									
								
								README.md
									
									
									
									
									
								
							
							
						
						
									
										24
									
								
								README.md
									
									
									
									
									
								
							@@ -1,23 +1,11 @@
 | 
				
			|||||||
# Surreal Crawler
 | 
					# Surreal Crawler
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Mapping with a budget of 1000 (crawl 1000 sites, so many more links are actually discovered), on [my webiste](https://oliveratkinson.net) on 8/26/2024 took 1m9s.
 | 
					Crawls sites saving all the found links to a surrealdb database. It then proceeds to take batches of 100 uncrawled links untill the crawl budget is reached. It saves the data of each site in a minio database.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
This is including the crawl and loading into the database and linking sites. (Locally hosted surreal db instance)
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
This run created 4299 site links with 23286 links between the sites. (It found my this git site which really bolsters those numbers.)
 | 
					### TODO
 | 
				
			||||||
 | 
					 | 
				
			||||||
## Install / Build
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
* You will need rust to compile the crawler [rustup.rs](https://rustup.rs)
 | 
					 | 
				
			||||||
* You need python3 (will come installed on most linux distros) and poetry for dependancy management.
 | 
					 | 
				
			||||||
    * Install `pipx`, `python3`
 | 
					 | 
				
			||||||
    * Then: `pipx install poetry`
 | 
					 | 
				
			||||||
    * Then: `poetry install` to install the project dependancies
 | 
					 | 
				
			||||||
* You need to install [surrealdb](https://surrealdb.com)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
## Use
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Just run `./crawl.sh {url}` and it will start crawling. You can tweak the budget inside [crawl.sh](https://git.oliveratkinson.net/Oliver/internet_mapper/src/branch/main/crawl.sh) if you want.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
You can also prefix the command with `time` to benchmark the system, such as: `time ./crawl.sh https://discord.com`.
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- [ ] Domain filtering - prevent the crawler from going on alternate versions of wikipedia.
 | 
				
			||||||
 | 
					- [ ] Conditionally save content - based on filename or file contents
 | 
				
			||||||
 | 
					- [ ] GUI / TUI ?
 | 
				
			||||||
 | 
					- [ ] Better asynchronous getting of the sites. Currently it all happens serially.
 | 
				
			||||||
		Reference in New Issue
	
	Block a user