In a non-indexed website, no one can hear you crawl
Skyrock.com is a social networking site based in France that offers a free space on the web to allow its users to create blogs, add profiles, and exchange messages with other registered members
Between 2006 and 2009, almost every french guy/girl within the teens range used to have a blog online.
Some are okay, some are creepy, but most of them are definitely cringe.
Can you feel it ? The dense and awful perfume of cryptic messages between teenager which is not without remembering you the early Facebook odyssey.
My significant other had one of them. She challenged me to find it.
After a few month of unfructous manual labor, I came to multiple conclusions and research
-
Google refused to indexes Skyrock.com (actually, some of it is indexed, but less than X%)
-
Most skyrock userbase was young (around 16 yo), Internet was deep and unknown at that time, so most content would be gibberish.
-
Username are composed of a maximum of 24 char which will be used in the URL, which restrict the charset to only 37 chars [abcdefghijklmnopqrstuvwxyz0123456789-] (lowercase only since it's would be digest in the URL).
EG: if my username is ParriauxMaxime, my skyblog should exist at https://parriauxmaxime.skyrock.com. -
Being a social network before Facebook mass adoption in France (~ 2009, take it or leave it), Skyrock allowed profiles to connect between each others : Fan/Source
Fan and source usually being 1:1, let's focus on Fans only.
Those are located at https://parriauxmaxime.skyrock.com/fans.html, https://parriauxmaxime.skyrock.com/fans2.html, ... -
Most of the userbase (at least 80%) had filled some essentials informations: age, localization, postalCode, country, etc.
Basically, bruteforcing every nickname possibility and scrapping data on-the-go would take around 4.37 * 10³⁴ seconds, assuming 1000 fetch/second.
Going with the precedent research, doing the same with a 68% confidence in the nickname length (6 - 15) would still take an eternity (~10¹¹ seconds)
(Been there, done that, useless)
Remember Six degrees of separation ?
In a nutshell, you're connected to anybody on this planet in less than 6 hops
Sanity approach would be to map through fans with a "close" localization to your target
- Install the dependencies
yarn
# or
npm i
- You gonna need a database (postgres with postgis activated, and i also used metabase to have a quick glance to the awful amount of data there)
docker-compose up -d
- Populate your database with what I got from 48hours of crawling (you need psql installed locally)
yarn psql_init
# or
npm run psql_init
- (optional), If you plan to continue scrapping this, you can uncache some predata
yarn uncache
# or
npm run uncache
- Start the application
yarn start:dev
- REPL time. The profile command will fetch through 2 levels of recursion within the fans of "nickname"
profile("nickname")
- Q: Did you succeed ?
A: Yes
- Q: What did it cost ?
A: Everything
- Q: My target did not filled his infos at that time, what should I do ?
A: Some of his/her friends/fans had those infos filled, you should try to iterate on "close" (geographic era/age) fans
- Q: What it I cannot find my target by age or localization ?
A: You could try to fetch last 5 posts per blogs along with the first ~100 comments. From there, you will need to implement feature detections (surname/city name/school name/etc). If you end up navigating huge pile of crap data, you can try to use Zipf law to trim the garbage.
A progressive Node.js framework for building efficient and scalable server-side applications.
Nest framework TypeScript starter repository.
$ npm install
# development
$ npm run start
# watch mode
$ npm run start:dev
# production mode
$ npm run start:prod
# unit tests
$ npm run test
# e2e tests
$ npm run test:e2e
# test coverage
$ npm run test:cov
Nest is an MIT-licensed open source project. It can grow thanks to the sponsors and support by the amazing backers. If you'd like to join them, please read more here.
- Author - Kamil Myśliwiec
- Website - https://nestjs.com
- Twitter - @nestframework
Nest is MIT licensed.