Line extraction approaches
Recently I discovered dropwhile and takewhile functions from itertools
. It’s quite common problem within a given project to extract lines from a file starting with lines A, B and C. Besides “traditional approaches” one can also go with a recursive function using dropwhile as seen below.
from itertools import dropwhile
import re
with open("./qwe", "r") as file_in:
data = file_in.readlines()
final_lines = []
start_prefixes = ("Mode:", "Class:")
Having a common baseline set up let’s extract lines starting with mentioned prefixes. Let’s misuse the dropwhile
:
def subtract_lines(lines: list):
sublines = dropwhile(lambda x: not x.startswith(start_prefixes), lines)
sublines = list(sublines)
if sublines:
final_lines.append(sublines[0])
if sublines[1:]:
subtract_lines(sublines[1:])
subtract_lines(data)
print(f"{final_lines=}")
Well, you can say this is a misuse of the method and it is. Having a list of lines already in the list data
there are simpler comprehension based solutions:
final_lines = [x for x in data if x.startswith(start_prefixes)]
print("These lines were extracted via a list comprehension"
f"and startswith\n{final_lines}")
# Generate regex prefixes
re_prefixes = "|".join([i + ".*" for i in start_prefixes])
# Match with regex
final_lines = [x for x in data if re.match(re_prefixes, x)]
print("These lines were extracted via a list comprehension"
f"and regex\n{final_lines}")
As we can see these solutions are not worse than the one above with dropwhile solution.
The only useful use case that comes to my mind is some sort of a log preprocessing. In a situation where you need to get rid of a huge header with undefined line count before the data lines of interest.