Reading Files: Line-by-Line vs Loading Entire File into RAM
When your program reads a file, it makes a fundamental choice: load everything into memory at once, or process it piece by piece. For small files, it doesn't matter. For large ones—logs, datasets, ETL pipelines—this decision can be the difference between a fast program and an out-of-memory crash.
Reading Line-by-Line (Streaming)
Instead of pulling the whole file into RAM, your program reads one line at a time, processes it, and moves on. The previous line is eligible for garbage collection immediately.
Python example
# Streams the file — only one line in memory at a time
with open("access.log", "r") as f:
for line in f:
if "ERROR" in line:
print(line.strip())
Even if access.log is 50 GB, this script uses only a few KB of RAM at any given moment.
Go example
file, _ := os.Open("access.log")
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
if strings.Contains(line, "ERROR") {
fmt.Println(line)
}
}
bufio.Scanner handles buffering internally — it reads in chunks (commonly 4–64 KB) from the OS, then delivers lines one at a time to your code.
Advantages
- Low memory usage. Only a small portion of the file lives in RAM at once.
- Immediate processing. Your program can start outputting results before the file finishes reading.
- Scales to any file size. A 1 GB and a 1 TB file cost about the same in memory.
- Safer for constrained environments like containers with tight memory limits.
Disadvantages
- Single-pass only. If you need to go back and re-read earlier lines, you have to restart.
- Complex multi-step operations. Sorting, joins, and deduplication are hard without holding everything in memory.
- Slightly higher overhead. More loop iterations and system call coordination than a bulk read.
Loading the Entire File into RAM
Read the file once into a data structure (list, array, string), then work with it freely.
Python example
# Loads entire file into a list of strings
with open("transactions.csv", "r") as f:
lines = f.readlines()
# Now you can sort, slice, and scan multiple times
lines.sort()
for i, line in enumerate(lines):
if line == lines[i - 1]:
print(f"Duplicate found: {line.strip()}")
Go example
data, _ := os.ReadFile("transactions.csv")
lines := strings.Split(string(data), "\n")
sort.Strings(lines)
for i := 1; i < len(lines); i++ {
if lines[i] == lines[i-1] {
fmt.Printf("Duplicate: %s\n", lines[i])
}
}
Advantages
- Random access. Jump to line 10,000 or scan backwards — no seeking required.
- Multiple passes. Sort, filter, then scan again without re-opening the file.
- Simpler code for operations that naturally need the whole dataset (sorting, aggregations, joins).
- Better CPU cache locality. Data is contiguous in RAM, so the CPU prefetcher works efficiently.
- Fewer system calls. One large read beats thousands of small ones.
Disadvantages
- High RAM consumption. A 2 GB file may consume 6–8 GB of RAM after parsing into strings or objects due to per-object overhead.
- Risk of OOM crashes. If the file grows beyond what RAM can hold, your process dies.
- Higher startup latency. Processing can't begin until the full load completes.
The Memory Overhead Trap
A common mistake: assuming a 500 MB file uses 500 MB of RAM when loaded.
In Python, each string object carries metadata (type pointer, reference count, length, hash). A file with 10 million short lines can easily use 3–5× more RAM than the raw file size.
import sys
lines = ["hello world"] * 1_000_000
# Raw bytes: ~11 MB
# Python list + string objects: ~85+ MB
print(sys.getsizeof(lines)) # list overhead
print(sys.getsizeof(lines[0])) # per-string overhead
This overhead doesn't exist when streaming — you process one string at a time and discard it.
Important Nuance: OS Page Cache
Even when reading line-by-line, your program is not actually making one disk read per line. Modern operating systems read ahead in page-sized chunks (commonly 4 KB–64 KB) and cache those pages in RAM.
Your code → reads one line
OS → fetches 64 KB from disk (or serves from page cache)
Your code → reads the next line
OS → already in cache, no disk I/O
This means:
- Line-by-line reading doesn't cause excessive disk seeks.
- If you read the same file twice in a short window, the second read may be served entirely from the OS page cache — nearly free.
- The gap in I/O performance between streaming and full-load is smaller than it appears, especially on warm cache.
Hybrid Approach: Chunk Reads
For very large files where line-by-line is too slow but full-load is too expensive, chunk reads offer a middle ground.
CHUNK_SIZE = 64 * 1024 # 64 KB
with open("bigfile.bin", "rb") as f:
while chunk := f.read(CHUNK_SIZE):
process(chunk)
Memory-mapped files (mmap) go further — the OS maps the file into the process's virtual address space and loads only the pages you actually touch:
import mmap
with open("largefile.dat", "rb") as f:
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
# Seek and read arbitrary offsets without loading the whole file
mm.seek(1_000_000)
header = mm.read(128)
mmap is particularly useful for binary formats where you need random access but can't afford to load everything.
Rule of Thumb
| Scenario | Approach |
|---|---|
| Log scanning, ETL pipelines, large CSVs | Stream line-by-line |
| Sorting, deduplication, multi-pass analytics | Load into RAM |
| File larger than ~25% of available RAM | Stream or chunk |
| Random access patterns on large files | mmap |
| Small files (<100 MB, low memory pressure) | Either works |
High-performance systems often don't choose one or the other — they use buffered chunk reads to balance throughput and memory, or mmap to let the OS manage what actually stays in RAM.
Conclusion
Loading a file into RAM wins on simplicity and speed for small-to-medium datasets — random access, sorting, and multi-pass operations all become trivial. Streaming wins on memory efficiency and scalability, letting you process arbitrarily large files safely. The OS page cache narrows the I/O gap between them, but the RAM cost of full-load remains real and can be 3–5× the file size in practice. When in doubt, ask: does your algorithm need the whole file at once? If not, stream it.