Generate files from a csv

Recently there was a need to create a set of files based on a csv file. Assume you have a csv file with comma as a separator that has a structure like this:

+------+--------------+--------------+
|  123 | A            | randoom_hdk  |
|  345 | A            | random_dsa   |
|  456 | B            | random_lts   |
|  678 | A            | random_gcy   |
|  789 | B            | random_zlq   |
|  901 | B            | random_onw   |
|  ... | ...          | ...          |
+------+--------------+--------------+

And we need to create files A and B sourced from second column with respective contents from the first column:

# file A

123
345
678

# file B

456
789
901

Solution that I came up with is based on shell and awk. Well in my opinion there is little competition to this solution:

awk -F "," '{ print $1 > "sensor_lists/"$2}' sensor-ids.csv

-F ","         cuts the csv file by comma 
$1             first column with ids
$2             second column with identifiers
>              redirect output to files based on the second column 
sensor_lists   the directory name

This one-liner creates the files with names based on the second column and fills in every of those entires in column 1.

Works, but what about the performance? First we create a csv with random data inside. We expect 972 files with Let’s benchmark the solution with hyperfine. We would need to cleanup the cache to have a clean run every single time:

hyperfine --prepare 'rm data/* && sync; echo 3 > sudo tee /proc/sys/vm/drop_caches' --runs 100 --show-output './create_lists.sh'

We get performance of

  Time (mean ± σ):     10.697 s ±  0.189 s    [User: 10.609 s, System: 0.073 s]
  Range (min … max):   10.360 s … 11.076 s    30 runs

Ok, how would a solution in python work out. I wrote it with quite some hard-coded lines of code to not have unnecessary variable allocations. More could’ve been done, but let’s go with this one for now. I will be using python 3.11.6 for this experimentation.

Code looks like this:

import pandas as pd

data = pd.read_csv("random_data.csv", names=[0, 1, 2], usecols=[0, 1])
data = data.groupby(1)[0].apply(list)

for index, value in data.items():
    file_name = f"./data/{index}"
    with open(file_name, "w") as file:
        file.write("\n".join([str(i) for i in value]))

And the performance of it looks like this:

  Time (mean ± σ):      1.789 s ±  0.031 s    [User: 1.637 s, System: 0.149 s]
  Range (min … max):    1.747 s …  1.869 s    30 runs

A bit strange that the python solution is significantly faster than the awk based solution.

2023-12-28

/posts/csv2files/ Davit Mayilyan