Web Scraping Orchestration

Goal

Create an index of Fandom.com and TVtropes.org then connect them. * TODO * Tutorials/Docker Postgres with Backup and Restore * Docker Traffic Through VPN * Merge SQLite databases * Minio Tutorial * S3 Backup and Restore * Nats Tutorial * Scrape some fandom sites and store them as sqlite databases * Share code with other people * Update docker networking tutorial * Scope out 2.0 Scraper * Use NATS pub sub + job queue * Use a networked database, so no sqlite * Store the images of pages in object storage * Have other scraping engines such as selenium and puppeteer * Use VPN and other proxies * Upgrade proxies to be changeable via pubsub * Customised Scraper * Develop UI for contextual annotation * What is this contextual annotation and how does it relate to question engine, that needs to be written out

Logs

2023-04-09
- We need to store the links
- Well we don't know if they are in the database or not
- There are a fuck tone of links
- We do not care about optimization until later
- Alright but can't we use the full_url as the primary key
- Ohhhhhhh that needs to be written down somewhere
  - Well we can do that, what would we need to change
  - We would need to change the insert and stuff
  - This is just far more efficient
  - Can we do join's on strings
  - Why not
  - It is easier than doing a select check for everything
  - YES FAR EASIER
  - Alright let's do it
- Done, it's now doing the recursive scraping
- Wonderful what's next
- Next is to do the network stuff
- Oh yes, wireguard or openvpn
- It would be cool to do wireguard
- Let's find a tutorial
- Wait this is a separate tutorial
- Yes, yes it is
- So IDK if I am scraping correctly, I am extracting all parts of the URL, I think is is compartamentalized well enough
- I should be reading right now
2023-04-06
- Well I just lost all those logs RIP
- Gotta remember to git stash
- So where were we
- We have the scrape URL file finished
- Yes what comes after that?
- Now we need to actually scrape it, then we need to have a session where we delete the scraped, add the log
- Next we need to scrape, upload contents, then delete scrape, add log
- That seems doable
- So how are we supposed to do the jobs?
- I mean extract the URL's
- Is this going to be a separate job
- Yes it is obviously a separate job
- So how does the system know when to do it?
- There are the we have the job logs, we have the scraping queue
- Are we renaming it job queue?
- Ya, then we can add another type of job
- This is where callback functions come in doesn't it
- YES
- We can just store it in the HTML contents
- No we store it in the logs, cause the contents can be object storage
- Do we want to practice JSON in sqlalchemy
- No we should do that in the tutorial
- Alright so how do we do this
- We append the logs with a links_extracted boolean
- Sure why the fuck not
- And we do string
- Do we want to jump to jobs server now?
- No we want to know how to do this single threaded
- Then we can do jobs and stuff
- So the next thing to do would be what exactltly?
- Alright so what would the next thing be?
- We read through the URLs and decide which ones need to be scraped
- Are we going to have to process the domains as well
- Yes, yes we will
- Then what do we do?
- We read all the URLs and path that have not been scraped and put them in queue after every URL we scrape until we have a lot of like 100
- Can we do that right now
- Probably
- So what was the plan?
- Subdomain parsing
- Wait do we really need that, we can just do the right SQL queries like distinct with a %.fandom.com, ya that is over engineering, trust the library developers
- So then we just need a function that can geed new urls into the queue and scrape again until a specific number of logs exist
- We still need to check http error and log it
- Alright than that's basically it then right
- Ya then we can start doing jobs and start doing the cool shit with Selenium
- We also need to link what links to what, and to do that we need to know what links are actually useful
- Oh ya I forgot about that
- So links?
- What links matter
- The ones with the
- Game of Thones was too big a wiki to start with, found 9000 pages
- We need to track the links between stuff, that is very important, that is the who point of this, we gotta find that graph sqlite example
  - Here is an example, Using PostgreSQL as a graph database | Ingrid Tech
  - dpapathanasiou/simple-graph: This is a simple graph database in SQLite, inspired by "SQLite as a document database"
  - examples.graphs.directed_graph — SQLAlchemy 2.0 Documentation
- Wait after we have the data what do we do with it?
- Paul you should have thought this part out
- How do we want to look in at this data
- Well we want to see the stories summarized
- We can also fetch the actual reference texts
- WE WANT TO LABEL THE EDGES
- We want to generate meme vectors
- We want to connect this to TV Tropes
- Oh ya, how do we do that
- Alright so next steps are?
  - Track the links between pages, use a graph
  - Find a small simple example rather than GameOfThrones
  - Write the postgres in docker tutorial
    - Maybe do MariaDB and MySQL at the same time
  - Scrape some fandom sites and store them as sqlite databases
    - Share code with other people
  - Update docker networking tutorial
  - Scope out 2.0 Scraper
    - Use NATS pub sub message queue
    - Use a networked database
    - Store the images of pages in object storage
    - Have other scraping engines
    - Use VPN and other proxies
  - Develop UI for contextual annotation
    - What is this contextual annotation and how does it relate to question engine, that needs to be written out
2023-04-01
- Alright so what do we do next?
- Well why are we doing this project?
- What is the utility of Fandom.com when I do not even read it?
- Well we want to find themes and connect them, the same way we connect these themes we can connect people to work on projects
- Dude that is based
- Glad no one reads my shit, talking to myself in cringe
- I need to talk to myself, no one else wants to listen
- Well you can make them listen, you would be supprised
- Thanks alter ego
- Alright so what schema do we want?
- We doing those Diagrams?
- Entity Relationship Diagram
- Well let's explain the different parts of this project first shall we?
  - We need the database
  - We need the scraper
  - We need jobs based scraper
- Alright that sounds pretty simple
- Yes it does
- Design your schema
- Alright so what is this Schema supposed to contain?
  - Web pages to scrape
  - The contents of web pages
  - Errors from web pages scraped
  - URL extracted form web pages
  - Edges, what pages link to each other
- So this is 5 tables?
- Sure why not
- Let's look at how we did it last time? sql CREATE TABLE IF NOT EXISTS SCRAPED_URLS_T( SCRAPED_URL_ID INTEGER PRIMARY KEY, URL_ID INTEGER, DATE_SCRAPED DATE, HTML TEXT) CREATE TABLE IF NOT EXISTS URLS_T( URL_ID INTEGER PRIMARY KEY, FULL_URL TEXT, SCHEMA TEXT, DOMAIN TEXT, PATH TEXT, PARAMS TEXT, QUERY TEXT, FRAGMENT TEXT )
- Okay so that looks okay
- How do we want to do this? Node or Python
- Well we should probably use Selenium
- Ya
- Cool Selenium can use extensions too
- We can also get a Mentor for this kind of stuff down the line
- Okay so we using SQLAlchemy then
- Yes
- We doing ORM
- Sure why not
- Why don't we just write the ORM code first
- Alright sure
- Where are we writing this code, github, gitlab, keybase, codeburg
- Gitlab
- So do I want to run this on SQLite or Postgres?
- Does it matter? You need a job system before you need to run postgres
- Alright
- So how about we write a SQLAlchemy Tutorial, then we can use it in the Web Scraping Tutorial
- Ya so what goes in the SQLALchemy Tutorial
- Alright so this is another project now right?
- Yes
2023-03-30
- So how do I scrape everything?
  - Well I can develop a basic web scraper, check for 200 return, and log errors
  - Didn't we do this before?
  - Yes when? There was an error table that was very important?
  - Keybase Binding?
  - No
  - ENS Indexing, that's what it was
  - So we just indexing all the raw HTML to a database
  - Ya why not, otherwise things get complicated
  - This might be a LOT of data
  - Who care, we need to scope this correctly, scrape by single site, list all sites
  - Can't I just go buy indexes like this off the internet?
  - Probably but good luck
  - I bet people usually just hire someone to do it for them
  - Alright so how do we do this?
  - Design Schema, Scrape, Scale, ETL
  - Alright it might actually be that simple
  - We are going to have to spend a lot of time designing this crap.
  - This seems doable

Web Scraping Orchestration

Goal

Logs

Backlinks