Archiving youtube and website data

YouTube has become a bit of a dilemma for many people like myself who enjoy music and video edits with said music; We love supporting artists we enjoy along with the video edits. But, with companies locking down on content, these videos and channels are going offline suddenly and often without warning. I’ve taken to downloading backups of these as often as possible. With a little help from r/datahoarding, I now have a great set up that does this with minimal user intervention.

The fine folks over at r/datahoarding swear by a tool called “youtube-dl”. For an example install on an Ubuntu WSL in Windows:

 sudo yum install python-pip ffmpeg
 sudo pip install youtube-dl 

Then it’s just a matter of providing content to download:

 youtube-dl -o '%(playlist)s/%(playlist_index)s - %(title)s.%(ext)s'  --format bestvideo+bestaudio/best --continue --sleep-interval 10  --verbose --download-archive PROGRESS.txt --ignore-errors --retries 10  --add-metadata --write-info-json --embed-subs --all-subs 

This will output everything from the channel its own directory (in this case “Uploads from Half as interesting”), sleep 10 seconds between downloads, store info/subs and store progress to prevent excessive traffic attempting to redownload videos. This is running on a dedicated system now called from Windows Task Manager once a week. The bonus is I have several playlists to download that I simply tag into whatever playlist I choose and the videos are download automatically in the background for future perusal.

Now, what about backing up an entire website/directory/open directory? Well, there’s a handy tool for that too: wget

Over at r/opendirectories (I love Reddit), the lads and lasses there have found some great data/images/videos/music/etc and it’s always a rush to get those downloaded before they’re gone. In some cases, it’s old software and images; Other times it’s old music from another country which is interesting to myself and others. In this case, again using the Windows Subsystem for Linux (WSL), you could do similar to below:

/usr/bin/wget -r -c -nH  --no-parent --reject="index.html*" ""

In this case, I’m skipping downloading the index files (not needed), the “-c” flag continues where it left off, and it downloads everything from that directory. This is handy for cloning a site or backing up a large amount of items at once. This can run for days possibly and can choke on large files (I’ve only seen issues with files over 70GB; Your mileage may vary) but this has worked well so far. I now have a bunch of music from Latin America in a folder for some reason.

What are your thoughts? Do you see a lot of videos missing or being copyrighted? Do you have a better way of doing this? Let me know!

Leave a Reply