Skip to main content

Line extraction approaches

Recently I discovered dropwhile and takewhile functions from itertools. It’s quite common problem within a given project to extract lines from a file starting with lines A, B and C. Besides “traditional approaches” one can also go with a recursive function using dropwhile as seen below.

from itertools import dropwhile
import re

with open("./qwe", "r") as file_in:
    data = file_in.readlines()

final_lines = []
start_prefixes = ("Mode:", "Class:")

Having a common baseline set up let’s extract lines starting with mentioned prefixes. Let’s misuse the dropwhile:

def subtract_lines(lines: list):
    sublines = dropwhile(lambda x: not x.startswith(start_prefixes), lines)
    sublines = list(sublines)

    if sublines:
        final_lines.append(sublines[0])

    if sublines[1:]:
        subtract_lines(sublines[1:])


subtract_lines(data)

print(f"{final_lines=}")

Well, you can say this is a misuse of the method and it is. Having a list of lines already in the list data there are simpler comprehension based solutions:

final_lines = [x for x in data if x.startswith(start_prefixes)]
print("These lines were extracted via a list comprehension"
      f"and startswith\n{final_lines}")

# Generate regex prefixes
re_prefixes = "|".join([i + ".*" for i in start_prefixes])
# Match with regex
final_lines = [x for x in data if re.match(re_prefixes, x)]
print("These lines were extracted via a list comprehension"
      f"and regex\n{final_lines}")

As we can see these solutions are not worse than the one above with dropwhile solution.

The only useful use case that comes to my mind is some sort of a log preprocessing. In a situation where you need to get rid of a huge header with undefined line count before the data lines of interest.