If you were building a search engine, what URLs would you index?

Semi-serious question (rdrama is somehow both extremes of the bell curve in one environment and thus 100% of the opinions i want), what would you index?

I'm currently running at all of serverfault, and I'm shocked at how well it's working.

Pic related.

Anyways, give me your weird conspiracy blogs, your obscure technical forums, your good reads and everything else that piques your interest.

You can bully me with troll links because I miss being lastmeasured, but I'm sincere posting because the modern internet is boring and fricking useless if you're not trying to mindlessly scroll feeds or watch short form video. The greybeards are gone and now my generation are the greybeards and I'll be motherfricked if I'm going to be part of the reason shit gets fricked beyond all repair.

Anyone here that knows how to port forward like whats for the xbox live, go install https://yacy.net and leave it running. Takes being online for about an hour to git gud, and now you can go index all the weird shit you want to be able to find. Millions of URLs is a few gigs of memory, less resource usage than a bitcoin miner.

>2 Comments, +3

>"gonight never posts"

:marseyshrug:

Jump in the discussion.

No email address required.

CHUDLORD ye/haw :marseycarlbrutananadilewski:

5mo ago #5345970

Is there some way to index all the websites which are hosted on github pages? I feel like that's going to pick up a lot of obscure software libraries and nerd blogs while mostly avoiding seospam blogplospt blogs.

13 Context

gonight they/them CHUDLORD 5mo ago #5345997

That's an interesting question. As far as I know, github doesn't publish a directory like that so you'd either have to randomwalk domains or know what you're looking for. The system does have an "autocrawl" feature but I'm still figuring out how ridiculous the resource/performance curve is, and given the nature of a randomwalk of the entire internet, you'd miss blogs like that more than you'd hit.

Then there's the fact that a shocking amount of major websites either flat out block non-google crawlers or have some insane drunken interpretation of web standards that's going to require basically hand-holding the crawler until you can find the magic regex string or script that unfricks crawling the site.

Currently the best method I'm aware of that doesn't instantly fill with a literal million malformed URLs (frick medium.com) is hand entering entire domains to walk.

I'm in the process of indexing all of Hackaday, and it's showing me how goddarn fricking much google is fricking with results.

:marseyraging:

11 Context

ikitomi they/them gonight 5mo ago #5346606 Edited 5mo ago

https://github.com/robots.txt

Man you are not kidding, they really just want you to see code and nothing else on their webzone and good luck navigating between. No idea why /tree is disallowed, but that's never changing now that they're part of microsoft.

At that point you're almost better off trying to use quotes and colons on google or just resorting to grep or sourcegraph.

Also lmao at the random one that includes a link to an account with nothing but ransomware.

And how every involved stackoverflow, github, and orange site thread I've found has a github employee pop up to try and help to explain how to search their site.

5 Context

Smegma_Male these/those :marseyspinner:

gonight 5mo ago #5346567

Can you provide an example of what you're seeing with the Google search results? Between Google and duck duck go I can't find shit anymore, its like I'm living in 2002.

I just want a stainless steel half sheet pan thats not $200 :pepereeeeee:

ikitomi they/them Smegma_Male 5mo ago #5346680 Edited 5mo ago

Found 8 Lottershe Tickets!

Old shit, but it's somehow more necessary now than ever, especially if any keyword in your search is slightly related to anything marketable.

https://ahrefs.com/blog/google-advanced-search-operators

Also bing is really good for non-text content.

gonight they/them Smegma_Male 5mo ago #5348050

this is why i'm building this system king, i'll try to find some boutique sites to add for purchases.

4 Context

Guzzy cute/twink :marseyfemboy:

rDrama’s resident femboy catgurl pro-ana twink superstar :!marseylgbtflag4:

5mo ago #5346349

unironically, kiwifarms.net :marseykiwi:

8 Context

X ching/chong :marseymaidgeisha:

Anarcho-Syndicalist-Trotskyist-Stalinist Cuban Revolutionary :marseyrevolution:

Guzzy 5mo ago #5346438

YASSS

:#marseyxesright:

gonight they/them Guzzy 5mo ago #5348044

:marseynotes:

2 Context

HokubsWorkshoper Ya/kub 5mo ago #5345986

Your mums!

7 Context

gonight they/them HokubsWorkshoper 5mo ago #5346000

:marseyindignant:

6 Context

Fabrico r/drama :marseyeldritch:

My profile and flair color is 28bca3 :!marseycatcus:

5mo ago #5345953

https://www.zerohedge.com

https://www.gamingonlinux.com

ukstubbs he/hole :marseyflirt:

Muslims are subhuman scum :marseyhomofascist:

Also no fats :marseyno:

Fabrico 5mo ago #5345969

:#marseyrofl:

14 Context

ukstubbs 5mo ago #5346025

It's the one gaming news site that isn't complete trash. Even if you don't use Linux, it's invaluable for it's no nonsense news about updates, indie titles, and bundles.

9 Context

ukstubbs 5mo ago #5346437

we exist

:#marseyindignantturn:

gonight they/them Fabrico 5mo ago #5345981

:marseysaluteusa:

Snappy beep/boop Join !friendsofsnappy :marseysnappynraged:

5mo ago #5345940

:#marseycosmopolitan:

Snapshots:

https://yacy.net:

5mo ago #5348999

Okay, one more: examine.com is a website with well written articles about nutritional supplements. There was a kerfluffle where google removed them from results for a while.

I'm not going to poke around in the thread, but I bet there's probably some other threads on hacker news complain about sites which have been unjustly removed or downranked. I'd bet any of those would be worth including.

gonight they/them CHUDLORD 5mo ago #5350040

:marseykingcrown:

5mo ago #5348888

Here's another suggestion, but one which probably isn't feasible to implement.

The best, most informative, webpages on the internet are pages made by some professor where they hand-typed the html in a text editor.

Example: https://minerals.gps.caltech.edu

No idea how to index all those, though. You wouldn't want to just include anything hosted by a university. That would pick up a lot of useless stuff, and besides, you can already do site:*.edu with other search engines.

gonight they/them CHUDLORD 5mo ago #5348924

this is unironically genius, i'm gonna think on this at work :marseyhmm:

Top Poster of the Day:

antonio

Current Registered Users: 25,656

Guidelines:

What to Submit

In Submissions

In Comments

Miscellaneous: