Worked from 2023-03-14 to 2023-09-26

🏷 Tags

🚦 Status

Paused

The goal of this was to have a way to manage my unorganized code folder dump. Unsure if others have this problem, but I often just start messing with code and don’t necessarily organize it from the start. This is a way to manage that mess.

Crawling

It currently supports crawling a directory until it finds a git repo or if any of the children’s folders are git repos. I’ve iterated over multiple different ways to accomplish this, which can be seen below for a ~60GB folder with around 50 git repos with many folders that are not git repos yet. I understand that this is not the best way to measure performance, but gives a good enough idea.

The initial crawling step, checking if a directory is a git repo or not. Whenever there’s a git repo, it also extracts relevant information about the repository.

Attempttime outputDescription
10.18s user 0.19s system 99% cpu 0.378 totalUsing a max depth approach
20.16s user 0.04s system 99% cpu 0.204 totalFollowing the algorithm below

Crawling algorithm

This is a mix of a breadth-first search and depth-first search recursive algorithm, following the steps:

1. Get a list of sub-directories and check if they have git (breadth)
   1. if yes, drill down in the non git directories (depth)
      1. start back at step 1 for every directory
   2. if no, stop

Getting all relevant data

This is a bit more thorough, getting a bit more information and interesting data. This is also the step that takes the longest.

  • * crawling step (time from the previous table)
  • * total byte size (more accurate than du -sh)
  • * all programming languages used via tokei
  • * read README.md
Attempttime outputDescription
116.07s user 31.78s system 126% cpu 37.782 totalGetting data from every directory
25.36s user 11.16s system 107% cpu 15.338 totalGetting data from leaf directories
35.41s user 10.59s system 94% cpu 16.971 totalGetting data from leaf directories + summing size
45.71s user 15.29s system 311% cpu 6.742 totalGetting data from leaf directories + summing size + parallelization
55.51s user 4.18s system 564% cpu 1.717 totalSame as before without calculating any size…

Parallelization is done with rayon making the code even simpler. From the data above, it seems like the biggest bottleneck is the file crawling from tokei. I had tried to improve on that step to no avail.

Output example

Here’s a sample of the output, only displaying the file size and relative path/filename:

Screenshot of the crawler

CLI

I had started to implement a CLI with cursive , but never finished it. Here’s a few of how it currently looks:

Screenshot of the CLI project overview
Screenshot of the CLI README view

🛠 Technologies

Languages

Other