markov chain


What can you do if you can (selectively) gain control whenever a program reads or writes a particular memory location?

  • With hardware support
  • With compiler-level support

Address translation design의 goal

  • Memory protection
  • Memory sharing
  • Flexible memory placement
  • Sparse addresses: 전체 address space중 일부만 집중해서 볼 수 있다?
  • Runtime lookup efficiency
  • Compact translation tables
  • Portability

Address translation으로 인해 가능하게 된 것들

  • Process isolation
  • Efficient IPC
  • Efficient memory use
  • 프로그램의 코드가 모두 memory에 올라오기 전에도 프로그램을 작동할 수 있다.
  • Dynamic memory allocation
  • Cache management
  • Program debugging (Seg. fault)
  • Zero-copy I/O
  • Memory-mapped files
  • Demand-paged virtual memory
  • Checkpointing/restart
  • Persistent data structures
  • Process migration
  • Information flow control
  • Distributed shared memory
Memory Translation
Method Pros Cons
Base & Bounds Simple, Fast
Can relocate in physical memory without changing process
Can't protect code section
Can't share memory with other processes
Can't grow stack/heap as needed
Segmentation Can detect if need to copy-on-write Complex
Garbage collecting
Multilevel Paging
Multilevel Paged Segmentation

Paging & fast program start: Remaining pages can be transferred in the background while program is running

Sparse Address Spaces

Multi-level Translation

x86 Multilevel paged segmentation

  • Global Descriptor Table (segment table)
    Array of [Pointer to a segment | Segment length | Access permission ]
    Context switch가 일어나면 GDTR(Global Descriptor Table Register)을 바꾼다.
  • Multilevel page table
    32-bit: 2-level, 64-bit: 4-level

Inverted page table: IBM Power PC는 hierarchical page table 대신 hash로 physical address를 구하는데 이것을 inverted page table이라고 부른다. 이름이 거지같은 것은 역사적 이유. 좀 더 살펴볼것


Cost of translation = Cost of TLB lookup + Prog(TLB miss) * cost of page table lookup

Software Loaded TLB: MIPS arch. 에서는 TLB를 Kernel이 update한다. OS가 두개의 page table을 유지할 필요가 없지만 대신 TLB update에 CPU cycle이 많이 필요하다.

TLB entry는 page일수도 있지만 Super page일수도 있다(entry를 절약). x86에서 TLB entry의 크기는 4K (a page), 2M, 1G일 수 있다.

TLB shootdown: cache synchronization

How expensive is a 4-level page table walk on a modern processor?

With a virtual cache, what do we need to do on a context switch?

What if the virtual cache > page size ?

Aliasing: 둘 이상의 virtual cache entry가 같은 physical memory를 가리키는 것 (뭐가 문제일까?)


Thrashing: Cache의 크기가 너무 작아서 -> Cache hit rate가 너무 낮고 -> Cache가 오히려 시스템의 성능을 떨어트림

Zipf distribution: popularity is self-reinforcing.

Working set: set of memory locations than need to be cached for reasonable cache hit rate

Cache lookup

  • Fully associative: simple. linear search
  • Direct mapped: hashing
  • Set associative:


Hardware implementation of page coloring (256 colors) and set-associative cache lookup

Hardware implementation of page coloring (256 colors) and set-associative cache lookup

Cache eviction policy
method description
MIN Ideal solution
가장 먼 미래까지 사용되지 않을 것을 버린다.
LFU recent past에서 참조 빈도를 count하여 가장 적게 사용한 것을 버린다.

Advantages of Memory-mapped Files

  • Programming simplicity
  • Zero-copy I/O
  • Pipelining: Application과 File을 load하는 부분이 분리되어 있다.
  • IPC: shared memory segment vs temporary file

Demand paging: Page fault가 발생하면 trap을 걸고 disk의 DMA가 끝나면 TLB miss를 다시 처리.

Allocating Page Frame


Device should provide

  • Non-volatile memory
  • Block level random access

File system provides

  • Persistence of data stored in file system
  • Naming
  • Performance (cache, data placement)
  • Controlled access to shared data (permission)


  • Hard link: points i-node
  • Soft link: path or file name

Mount: mapping from name in one file system to root of another


Structure of a disk

Structure of a disk

Error correcting on disk

  • Sector sparing: bad sector가 발생할 때를 대비하여 spare된 sector들이 있다.
  • Slip sparing: bad sector가 발생하면 sector들을 remap하여 sequential behavior를 유지한다.
  • Track skewing

Disk latency = Seek time (1-20ms) + Rotation time (4-15ms) + Transfer time (5-10 us/sector) * N (sector)
Seek time은 disk arm을 움직이는데 걸리는 시간이고, Rotation time은 disk가 회전하여 읽을 위치가 오기를 기다리는 시간이다.

Disk Scheduling
Method Description
Shortest seek time first
SCAN Disk의 한 surface를 극좌표로 나타내면 읽어야 할 지점들의 r에 따라서 한쪽 끝에서 다른 끝으로 움직이며 읽는 것.
CSCAN SCAN을 양 방향 (r이 늘어나거나 줄어드는 방향)이 아니라 한 방향으로만 하는 것.
R-CSCAN CSCAN을 하되 rotation delay까지 고려하는 것

(Flash Drive) Why are random writes so slow?

Flash Translation Layer (Usually on flash drive firmware)

  • Move live pages as need for erasure
  • Wear-leveling
  • Avoid pages that no longer work

How does Flash device know which blocks are live?

TRIM command: File system이 device에게 특정 page가 더이상 쓰이지 않을 것이라 알려주는 것.

Transaction and ACID

  • Atomic:
  • Consistent: moves the system from one legal state to another legal state
  • Isolated: Transaction은 commit되기 전까지 다른 transaction들에게 보여서는 안된다.
  • Durable: Completed된 operation들은 complete된 상태로 남아있어야 한다. (Need recovery)

Critical section은 ACI는 가지지만 D(durability)는 없다.

4 stages of executing a transaction

  • Prepare
  • Commit
  • Write-back
  • Garbage collect

Recovery with Redo logging


Sector (512B) / Page (2~16K) : Magnetic disk drive / Flash drive

Extent: File의 additional writing을 대비하여 scattering과 fragmentation을 줄이기 위해서 한 파일에게 할당해주는 수 track의 넉넉한 공간. NTFS, EXT4, btrfs등이 지원한다.

Free map: List of free disk blocks

Design challenges

  • Index structure: File의 block들을 어떻게 배치할 것인가
  • Index granularity: Block size를 얼마로 할 것인가
  • Free space
  • Locality
  • Reliability
File System Design Options
Index structure Linked list Tree (fixed, asymmetric) Tree (dynamic)
Granularity block block extent
Free space allocation FAT array Bitmap (fixed location) Bitmap (file)
Locality defragmentation Block groups + reserve space Extents, Best fit defragmentation

Named Data in a File System

MS FAT stands for “File Allocation Table”.
Each file: represented as a linked list of blocks
Directories: Map file name to file number, index of the file table.

Berkeley UNIX FFS
i-node table and i-node which have 15 block pointers and metadata.
i-node have …

  • Metadata:  ownership, permission, directory or not, access/creation/modification time.
  • 12 direct data pointer
  • Indirect block pointer: 1 indirect block pointer, 1 doubly indirect block pointer, 1 triply indirect block pointer. indirect block pointer points a block full of data pointers. (4B per entry, 1K entries).

Free space management: Bitmap
Block group allocation: Disk를 가까운 track들의 집합으로 나누고 한 directory 안의 파일들을 같은 그룹에 넣으려고 한다.

FFS block group structure

FFS block group structure

Write operation on FFS

Write operation on FFS

Microsoft NTFS
Master File Table: 

FS Description Pros Cons
FAT linked list Easy to append, delete and find free block Slow random access
Poor locality
No hard links
No reliability
Limited metadata, access control, volume size and file size
FFS i-nodes, tree Efficient and have locality for both small, large files and metadata Inefficient for tiny files
Inefficient when file is mostly contiguous.
Need to reserve free space to prevent fragmentation

“/a/b/c” -> ‘/’ (root)를 포함하여 4개의 i-node, 3개의 directory.
Directory는 file name을 file number로 mapping하는 파일. Small directory는 linear search하고, large directory는 B+Tree로 search한다.
B+Tree는 binary search tree의 extension으로 2개가 아니라 여러개의 entry를 가진다.


4 approaches to reliability

  • File system operation을 sequential하게 수행
  • COW
  • Journaling
  • Log structure

RAID: approach to availability (reliability도 주지 않나?)

Reliability: 시스템이 주어진 기능을 할 확률
Availability: 시스템이 사용가능한 시간의 비율

Approach 1: Careful Ordering
Sequence operations in a specific order
Post-crash recovery
FAT, FFS 등이 사용한다.

Careful Ordering
File System Operation Description Recovery
FAT Append 1. Add data block
2. Mark the corresponding MFT entry as used
3. Update file tail to point to new MFT entry
4. Update access time at Head of file
1. Scan MFT
2. If entry is unlinked, delete data block
3. If access time is incorrect, update
Create 1. Allocate data block
2. Update MFT entry to point to data block
3. Update directory data block with file name -> file number – What if directory spans multiple disk blocks?
4. Update modification time for directory
1. Scan MFT
2. If any unlinked files (not in any directory, delete
3. Scan directories for missing update times
FFS Create 1. Allocate data block
2. Write data block
3. Allocate i-node
4. Write i-node block
5. Update bitmap of free blocks
6. Update directory with file name -> file number
7. Update modification time for directory
1. Scan i-node table
2. If any unlinked files (not in any directory), delete
3. Compare free block bitmap against i-node trees
4. Scan directories for missing update/access times
Move 1. Remove file name from old directory
2. Add file name to new directory
1. Scan all directories to determine set of live files
2. Consider files with valid i-nodes and not in any directory as – New file being created? – File move? – File deletion?
Application Modification 1. Write list of name of each open file to app folder
2. Write changes to backup file
3. Rename backup file to be file (atomic operation provided by file system)
4. Delete list in app folder on clean shutdown
1. On start, see if any files were left open
2. If so, look for backup file
3. If so, ask user to compare versions

FFS: Move and Grep

Approach 2: Copy-on-Write
원래의 data 자리에 update하는 대신 새로운 곳에 새 데이터를 쓴다.

Copy on write in FFS

Write by copy on write in FFS

Update by copy on write in FFS

Update by copy on write in FFS

Locality를 유지하기 위해서 garbage collection이 필요

Approach Pros Cons
Careful Ordering 단순한 disk drive에서도 작동한다.
복잡한 operation들에서도 작동한다.
recovery 시간이 길다.
모든 동작들을 safely interruptible sequence of writes로 바꾸기 어렵다.
여러 operation들이 동시에 일어나면 consistency를 보장하기 어렵다.
Copy On Write Failure를 극복한 correct behavior
빠른 recovery
높은 throughput (특히 update들이 batch되었을 때)
Potential of high latency
작은 change도 많은 write를 필요로 한다.
Garbage collection을 필요로 한다.

Journaling / Logging File Systems
Log/Journal is append-only
ext3/4는 redo logging과 journaling을 모두 지원한다.
What happens if machine crashes?

  • Before transaction start
  • After transaction start, before operations are logged
  • After operations are logged, before commit
  • After commit, before write-back
  • After write-back before garbage collection
  • During recovery

Transaction isolation, Two-phase locking

Log is data storage; no copy back


  • 0: 데이터를 분산시켜 속도를 빠르게
  • 1: 같은 데이터를 여러 디스크에
  • 5: 여러 디스크 중 하나를 Parity로
  • 6: 여러 디스크 중 하나 이상을 Parity로
RAID 5: Rotating parity and striping

RAID 5: Rotating parity and striping

RAID 5 Update: Read old data block, Read old parity block, Calculate new parity, Write new data block, Write new parity block

Zero copy I/O: Kernel과 User가 같은 memory를 share함으로써 redundant한 copy를 없앤다.


Shadow page table: para-virtualization에서 (Guest OS의 수정이 필요) Guest virtual address를 Host physical address로 바로 mapping하는 page table이다. x86에서 hardware support를 통해 full virtualization에서도 위와 같은 것이 가능해졌다 (Intel VT-d technology).

VMM Memory Compression:
Host OS와 여러 Guest OS들 사이에 중복되는 memory들, 예를들면 libc code와 같은 것들을 여러 copy가지고 있지 않고 하나만 가지고 있게 하는것. 한 OS안에서는 잘 관리가 되고 있는데 (COW와 같은 여러 technique들) 여러 OS들 사이에서는 HostOS가 GuestOS의 memory를 tracking할 수 없다는게 함정. 그래서 scavenger process를 두로 주기적으로 메모리를 검사해서 중복되는 page를 검사한다.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.