0.1 pathlib

pathlib은 산재된 데이터를 체계적으로 관리하고, 데이터 분석이나 엔지니어링 작업을 수행할 때 매우 유용한 도구이다. 파일 시스템에서의 데이터 접근, 조작, 관리를 간결하고 효율적으로 할 수 있게 해주기 때문이다.

0.1.1 `pathlib` 기능

경로 조작: 경로를 쉽게 조합하고, 분해하며, 변경할 수 있다.
파일 시스템 정보 조회: 파일의 존재 유무 확인, 파일 크기 조회, 수정 날짜 조회 등의 정보를 쉽게 얻을 수 있다.
파일 시스템 작업: 파일 또는 디렉토리 생성, 읽기, 쓰기, 이름 변경, 삭제 등의 작업을 할 수 다.
경로 탐색: 특정 패턴이나 조건에 맞는 파일을 경로 내에서 찾을 수 있다.

0.1.2 응용 상황

데이터 파일 구조화 및 접근
- 동적 경로 생성: 다양한 데이터 소스나 폴더 구조에 대한 동적 경로를 쉽게 생성할 수 있다. 예를 들어, 날짜별로 구분된 로그 파일을 처리할 때 Path 객체를 사용하여 해당 날짜의 경로를 쉽게 생성할 수 있다.
- 파일 탐색: 특정 패턴이나 조건에 맞는 파일들을 glob 또는 rglob 메소드를 사용하여 쉽게 찾을 수 있다. 이는 분석 대상 파일을 자동으로 식별하는 데 유용하다.
데이터 파일 읽기 및 쓰기 작업
- 파일 읽기/쓰기: pathlib을 사용하면 파일을 열고 읽거나 쓰는 작업을 직관적으로 수행할 수 있다. Path 객체의 read_text, write_text, read_bytes, write_bytes 메소드를 활용하여 파일 내용을 쉽게 처리할 수 있다.
파일 및 디렉토리 관리
- 파일 생성 및 삭제: touch 메소드로 새 파일을 생성하거나, unlink 메소드로 파일을 삭제할 수 있다.
- 디렉토리 생성 및 삭제: 생성에는 mkdir 메소드를, 삭제에는 rmdir 메소드를 사용할 수 있다.
- 경로 유효성 검사: 파일이나 디렉토리의 존재 여부를 확인하고, 경로가 파일인지 디렉토리인지 등의 속성을 검사할 수 있다.
플랫폼 독립적 경로 처리
- 운영 체제 호환성: pathlib은 윈도우, 맥, 리눅스 등 다양한 운영 체제에서 동일하게 작동하므로, 코드의 이식성이 향상된다.

0.1.3 Example

동적 경로 생성 예시

날짜별 로그 파일이 저장된 디렉토리 구조를 가정하고, 특정 날짜에 해당하는 로그 파일의 경로를 동적으로 생성하는 방법.

from pathlib import Path

# 현재 날짜를 기준으로 경로 생성
current_date = datetime.now().strftime('%Y-%m-%d')
log_dir = Path(f"./logs/{current_date}")

# 해당 날짜의 로그 디렉토리가 없다면 생성
log_dir.mkdir(parents=True, exist_ok=True)

# 예시 로그 파일 경로
log_file_path = log_dir / "error.log"

print(f"오늘 날짜의 로그 파일 경로: {log_file_path}")

파일 탐색 예시
- glob 메소드를 사용하여 특정 패턴(예: 모든 .txt 파일)에 맞는 파일들을 탐색하는 방법
- rglob는 현재 디렉토리뿐만 아니라 모든 하위 디렉토리에서도 탐색을 수행 ```python # 현재 디렉토리의 모든 .txt 파일 탐색 for txt_file in Path(“.”).glob(“*.txt”): print(f”Found text file: {txt_file}“)

1 현재 디렉토리 및 모든 하위 디렉토리의 .txt 파일 탐색

for txt_file in Path(“.”).rglob(“*.txt”): print(f”Found text file in current and subdirectories: {txt_file}“)

2 .csv 파일 탐색

for file_path in data_dir.glob(“*.csv”): print(f”CSV 파일 처리: {file_path}“)

3 파일 읽기 및 처리

for file_path in data_dir.iterdir(): if file_path.is_file() and file_path.suffix == “.txt”: content = file_path.read_text() # 파일 내용 처리 print(f”{file_path.name} 파일의 내용 처리”)



* 파일 읽기/쓰기

```python
# 텍스트 파일 경로 생성
text_file_path = Path("./example.txt")

# 텍스트 파일 쓰기
text_file_path.write_text("Hello, pathlib! This is a text file.")

# 텍스트 파일 읽기
text_content = text_file_path.read_text()
print("Text file content:", text_content)

# 바이너리 파일 경로 생성
binary_file_path = Path("example.bin")

# 바이너리 파일 쓰기 (예제 데이터로 'Hello, pathlib!' 문자열의 바이트를 사용)
binary_file_path.write_bytes(b"Hello, pathlib! This is a binary file.")

# 바이너리 파일 읽기
binary_content = binary_file_path.read_bytes()
print("Binary file content:", binary_content)

파일 생성 및 삭제

# 새 파일 생성
new_file = Path("./new_file.txt")
new_file.touch()  # 파일이 존재하지 않으면 생성

# 파일 존재 여부 확인
if new_file.exists():
    print(f"{new_file} exists.")

# 파일 삭제
new_file.unlink()

# 파일 존재 여부 재확인
if not new_file.exists():
    print(f"{new_file} has been deleted.")

디렉토리 생성 및 삭제

# 새 디렉토리 생성
new_dir = Path("./new_directory")
new_dir.mkdir(parent=True, exist_ok=True)  # 디렉토리가 존재하지 않으면 생성, 이미 있으면 오류 발생하지 않음

# 디렉토리 존재 여부 확인
if new_dir.exists():
    print(f"{new_dir} exists.")

# 디렉토리 삭제
# 주의: rmdir()는 디렉토리가 비어 있을 때만 사용 가능
new_dir.rmdir()

# 디렉토리 존재 여부 재확인
if not new_dir.exists():
    print(f"{new_dir} has been deleted.")

# 현재 디렉토리에 데이터 파일 폴더 생성
data_dir = Path.cwd() / "data_files"
data_dir.mkdir(parents=True, exist_ok=True)

parents=True는 상위 디렉토리가 없는 경우에도 필요한 모든 상위 디렉토리를 함께 생성하라는 의미이다. 예를 들어, /a/b/c/d와 같은 디렉토리를 만들고 싶지만, /a, /a/b, /a/b/c 디렉토리가 아직 없을 때:
- parents=False (기본값)로 설정하고 /a/b/c/d 디렉토리를 생성하려고 하면, 상위 디렉토리가 없기 때문에 FileNotFoundError 오류가 발생.
- parents=True로 설정하면, Python은 먼저 /a, 그 다음 /a/b, /a/b/c를 차례로 생성한 후, 마지막으로 /a/b/c/d를 생성한다. 즉, 지정된 경로에 필요한 모든 상위 디렉토리를 자동으로 생성해준다.
```
# 상위 디렉토리가 없는 깊은 경로 설정
deep_directory_path = Path("./a/b/c/d")
# 모든 필요한 상위 디렉토리와 함께 디렉토리 생성
deep_directory_path.mkdir(parents=True, exist_ok=True)
```
exist_ok=True는 지정된 경로에 디렉토리(폴더)를 생성하는 옵션이다. 이 옵션을 True로 설정하면, 만약 생성하려는 디렉토리가 이미 존재하는 경우에도 오류를 발생시키지 않고, 해당 명령을 무시한다. 즉, 해당 디렉토리의 존재 여부와 상관없이 프로그램이 계속 실행될 수 있도록 한다. exist_ok 매개변수의 기본값은 False 이다. 따라서, exist_ok를 명시적으로 지정하지 않고 디렉토리를 생성하려 할 때 해당 디렉토리가 이미 존재한다면, FileExistsError가 발생 및 프로그램이 중단된다.
이 옵션들을 사용하면, 스크립트의 안정성을 높이고, 디렉토리가 이미 존재하는 상황에서도 원활하게 코드를 실행할 수 있다.
경로 유효성 검사

# 임의의 파일 및 디렉토리 경로
file_path = Path("example.txt")
dir_path = Path("example_dir")

# 파일 및 디렉토리 존재 여부 검사
if file_path.exists():
    print(f"{file_path} exists.")
else:
    print(f"{file_path} does not exist.")

if dir_path.exists():
    print(f"{dir_path} exists.")
else:
    print(f"{dir_path} does not exist.")

# 경로가 파일인지 디렉토리인지 검사
if file_path.is_file():
    print(f"{file_path} is a file.")

if dir_path.is_dir():
    print(f"{dir_path} is a directory.")

3.0.1 실무 적용

root_path = path.cwd() # /home/kmkim/pda/dsp-research-strep-a/kkm
prefix = 'data'
directory_names = ['cfx-baseline-subtraction','pda-raw-sample']
product_names = ['GI-B-I', 'GI-B-II', 'GI-P', 'GI-V', 'RP1', 'RP2', 'RP3', 'RP4', 'STI-CA', 'STI-EA', 'STI-GU']
consumables = ['8-strip','96-cap']
plate_numbers = ['002','005','031','032','035','036','041']

cfx_data = []
raw_data = []

for directory_name in directory_names: 
    for product_name in ['GI-B-I']: #product_names:
        for consumable in ['8-strip']: #consumables:
            for plate_number in plate_numbers:
                full_path = root_path / prefix / directory_name / product_name / consumable / plate_number
                processed_path = full_path / "processed" / "example1"
                processed_path.mkdir(parents=True, exist_ok=True)
                exporting_path =  full_path / "exported_pcrd"
                if 'cfx' in exporting_path: 
                    temp_cfx_data = bioradparse.load_pcrdata(exporting_path, datatype="cfx-xl")
                    cfx_data.append(temp_cfx_data)
                temp_raw_data = bioradparse.load_pcrdata(exporting_path, datatype="cfx-batch-csv")
                raw_data.append(temp_raw_data)
#pathlib.Path(f"./data/baseline-subtracted/processed/example1")

for pcrname, pcrdata in raw_data.items():
    bioradparse.save_pcrdata(raw_data, root_path / "pda-raw-sample" / "processed" / "example1" / f"{pcrname}.parquet")
for pcrname, pcrdata in cfx_data.items():
    bioradparse.save_pcrdata(cfx_data, root_path / "cfx-baseline-subtraction" / "processed" / "example1" / f"{pcrname}.parquet")