emoji-award-marseysurftheweb
Unable to load image

If you were building a search engine, what URLs would you index?

Semi-serious question (rdrama is somehow both extremes of the bell curve in one environment and thus 100% of the opinions i want), what would you index?

I'm currently running at all of serverfault, and I'm shocked at how well it's working.

https://i.rdrama.net/images/16995104268368113.webp

Pic related.

Anyways, give me your weird conspiracy blogs, your obscure technical forums, your good reads and everything else that piques your interest.

You can bully me with troll links because I miss being lastmeasured, but I'm sincere posting because the modern internet is boring and fricking useless if you're not trying to mindlessly scroll feeds or watch short form video. The greybeards are gone and now my generation are the greybeards and I'll be motherfricked if I'm going to be part of the reason shit gets fricked beyond all repair.

Anyone here that knows how to port forward like whats for the xbox live, go install https://yacy.net and leave it running. Takes being online for about an hour to git gud, and now you can go index all the weird shit you want to be able to find. Millions of URLs is a few gigs of memory, less resource usage than a bitcoin miner.

>2 Comments, +3

>"gonight never posts"

:marseyshrug:

29
Jump in the discussion.

No email address required.

Is there some way to index all the websites which are hosted on github pages? I feel like that's going to pick up a lot of obscure software libraries and nerd blogs while mostly avoiding seospam blogplospt blogs.

Jump in the discussion.

No email address required.

That's an interesting question. As far as I know, github doesn't publish a directory like that so you'd either have to randomwalk domains or know what you're looking for. The system does have an "autocrawl" feature but I'm still figuring out how ridiculous the resource/performance curve is, and given the nature of a randomwalk of the entire internet, you'd miss blogs like that more than you'd hit.

Then there's the fact that a shocking amount of major websites either flat out block non-google crawlers or have some insane drunken interpretation of web standards that's going to require basically hand-holding the crawler until you can find the magic regex string or script that unfricks crawling the site.

Currently the best method I'm aware of that doesn't instantly fill with a literal million malformed URLs (frick medium.com) is hand entering entire domains to walk.

I'm in the process of indexing all of Hackaday, and it's showing me how goddarn fricking much google is fricking with results.

:marseyraging:

Jump in the discussion.

No email address required.

https://github.com/robots.txt

Man you are not kidding, they really just want you to see code and nothing else on their webzone and good luck navigating between. No idea why /tree is disallowed, but that's never changing now that they're part of microsoft.

At that point you're almost better off trying to use quotes and colons on google or just resorting to grep or sourcegraph.

Also lmao at the random one that includes a link to an account with nothing but ransomware.

And how every involved stackoverflow, github, and orange site thread I've found has a github employee pop up to try and help to explain how to search their site.

Jump in the discussion.

No email address required.

Can you provide an example of what you're seeing with the Google search results? Between Google and duck duck go I can't find shit anymore, its like I'm living in 2002.

I just want a stainless steel half sheet pan thats not $200 :pepereeeeee:

Jump in the discussion.

No email address required.

Old shit, but it's somehow more necessary now than ever, especially if any keyword in your search is slightly related to anything marketable.

https://ahrefs.com/blog/google-advanced-search-operators

Also bing is really good for non-text content.

Jump in the discussion.

No email address required.

this is why i'm building this system king, i'll try to find some boutique sites to add for purchases.

Jump in the discussion.

No email address required.

unironically, kiwifarms.net :marseykiwi:

Jump in the discussion.

No email address required.

YASSS

:#marseyxesright:

Jump in the discussion.

No email address required.

:marseynotes:

Jump in the discussion.

No email address required.

Your mums!

Jump in the discussion.

No email address required.

:marseyindignant:

Jump in the discussion.

No email address required.

Jump in the discussion.

No email address required.

Jump in the discussion.

No email address required.

It's the one gaming news site that isn't complete trash. Even if you don't use Linux, it's invaluable for it's no nonsense news about updates, indie titles, and bundles.

Jump in the discussion.

No email address required.

we exist

:#marseyindignantturn:

Jump in the discussion.

No email address required.

:marseysaluteusa:

Jump in the discussion.

No email address required.

Jump in the discussion.

No email address required.

Okay, one more: examine.com is a website with well written articles about nutritional supplements. There was a kerfluffle where google removed them from results for a while.

I'm not going to poke around in the thread, but I bet there's probably some other threads on hacker news complain about sites which have been unjustly removed or downranked. I'd bet any of those would be worth including.

Jump in the discussion.

No email address required.

:marseykingcrown:

Jump in the discussion.

No email address required.

Here's another suggestion, but one which probably isn't feasible to implement.

The best, most informative, webpages on the internet are pages made by some professor where they hand-typed the html in a text editor.

Example: https://minerals.gps.caltech.edu

No idea how to index all those, though. You wouldn't want to just include anything hosted by a university. That would pick up a lot of useless stuff, and besides, you can already do site:*.edu with other search engines.

Jump in the discussion.

No email address required.

this is unironically genius, i'm gonna think on this at work :marseyhmm:

Jump in the discussion.

No email address required.

Link copied to clipboard
Action successful!
Error, please refresh the page and try again.