Generate files from a csv
Recently there was a need to create a set of files based on a csv file. Assume you have a csv file with comma as a separator that has a structure like this:
+------+--------------+--------------+
| 123 | A | randoom_hdk |
| 345 | A | random_dsa |
| 456 | B | random_lts |
| 678 | A | random_gcy |
| 789 | B | random_zlq |
| 901 | B | random_onw |
| ... | ... | ... |
+------+--------------+--------------+
And we need to create files A
and B
sourced from second column with respective contents from the first column:
# file A
123
345
678
# file B
456
789
901
Solution that I came up with is based on shell and awk
. Well in my opinion there is little competition to this solution:
awk -F "," '{ print $1 > "sensor_lists/"$2}' sensor-ids.csv
-F "," cuts the csv file by comma
$1 first column with ids
$2 second column with identifiers
> redirect output to files based on the second column
sensor_lists the directory name
This one-liner creates the files with names based on the second column and fills in every of those entires in column 1.
Works, but what about the performance? First we create a csv with random data inside. We expect 972 files with Let’s benchmark the solution with hyperfine. We would need to cleanup the cache to have a clean run every single time:
hyperfine --prepare 'rm data/* && sync; echo 3 > sudo tee /proc/sys/vm/drop_caches' --runs 100 --show-output './create_lists.sh'
We get performance of
Time (mean ± σ): 10.697 s ± 0.189 s [User: 10.609 s, System: 0.073 s]
Range (min … max): 10.360 s … 11.076 s 30 runs
Ok, how would a solution in python work out. I wrote it with quite some hard-coded lines of code to not have unnecessary variable allocations. More could’ve been done, but let’s go with this one for now. I will be using python 3.11.6
for this experimentation.
Code looks like this:
import pandas as pd
data = pd.read_csv("random_data.csv", names=[0, 1, 2], usecols=[0, 1])
data = data.groupby(1)[0].apply(list)
for index, value in data.items():
file_name = f"./data/{index}"
with open(file_name, "w") as file:
file.write("\n".join([str(i) for i in value]))
And the performance of it looks like this:
Time (mean ± σ): 1.789 s ± 0.031 s [User: 1.637 s, System: 0.149 s]
Range (min … max): 1.747 s … 1.869 s 30 runs
A bit strange that the python
solution is significantly faster than the awk
based solution.