V's Infodump Blog

Creating A Webcrawler for Neocities

September 18th, 2022

I'm starting a new job soon, so I've been doing a shitton of reading to brush up on my Python and SQL. For fun, I decided to create a webcrawler to scrape data from Neocities. I've hit a temporary roadblock with writing the URLs to a txt file--I don't really have to, but I wanted to just for debugging purposes. Then I got frustrated and decided to play videogames instead, lol.

It's not as difficult as one may think to create a bot. This is my first time making one; previously I had no idea how they worked. I drew the steps for it from Blueprints for Text Analytics Using Python. I got the copy from my local library, but I'll let you do with that link what you will.

It's relatively easy to create a bot. The issue is making sure it doesn't trip bot detection, which many modern websites have. Sending too many HTTP requests in a short amount of time (think several in a number of miliseconds) will usually do it. Try to space your HTTP requests out, if not for avoiding bot-detection, then for courtesy. Hosting is expensive, after all.

Something that will also camoflage your bot is including a "header," which is a small string of data internet browsers will pass to sites they interact with. They usually include some information about the computer and browser the webuser has. This doesn't have to be truthful though. I just used 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE', which is not my real info.

The vast majority of websites will have a document labelled "robots.txt," that gives the URL for a sitemap contained in an xml file as well as parts of the website that the server doesn't want bots crawling. This makes creating a webcrawler much, much easier. It's really just a matter of scraping the URLs from the sitemap (which is easy because xml, being a markup language, has tags around each url) and then supplying them to your python program to connect to.

I'm probably just going to scrape a little text from the body of the pages my bot crawls, as a kind of preview of the site. Besides this being practice, it also gives me a chance to find some cool sites that I would have never found otherwise.


CSS troubles & Indie Websites

September 16th, 2022

JFC I have not used CSS since I was a teen and the learning curve is there, lmao. I'm not as good at learning things as I was as a kid, it seems.

I've been looking up stuff lately about creating websites. My primary skills are in databases and object-oriented programming languages, so I don't have a lot of experience with HTML or CSS, although I did code Tumblr themes back in highschool (I did a lot of roleplaying.)

Still considering whether I should create a site on the clearnet or not. I've been looking at this guide (letsdecentralize.org) for how to set up a darknet site. Although, from my reading, it looks like tor sites recieve a lot more bot traffic than clearnet sites, so it might be in my best interest to either make a clearnet site or use a protocol besides HTTP. I'm considering using Gemini, since it basically only allows text, and that's something I'm interested in.

From speaking to one of my friends, who's a web penetration tester, static sites are more secure than dynamic sites. This is probably a "duh" to anyone who works in web security, but I don't, so the thought never crossed my mind. I have worked in software development, though, and involving user-entered data into... Well, anything, is a good way to end up with issues you never would have expected.