String
String
  • Strings of characters, immutable
    1. s = 'Hello World!'
    2. #access element
    3. print(s[0]) #H
    4. #slicing
    5. print(s[:2]) #He
    6.  
    7. #slice
    8. sl = slice(0, 10, 2)
    9. print(s[sl]) #HloWr
    10. #concatenation
    11. print(s+' ...')
    12. #repeat
    13. print(s*2)
    14. #in
    15. if 'H' in s:
    16. print('Contain H')
    17. #raw string, suppresses actual meaning of escape characters
    18. print(r'raw string\n')
    19. s = 'raw string\n'
    20. print('%r' % s) # output string as raw string
    1. s = 'Hello World!'
    2. #capitalize, capitalizes first letter of string
    3. print(s.capitalize())
    4. #center
    5. print(s.center(20, '*'))
    6. #count
    7. print(s.count('l'))
    8. #endswith
    9. if s.endswith('!'):
    10. print('end with ! ...')
    11. #find
    12. print(s.find('or')) #7
    13.  
    14. #index, find a string and raise an exception if the string is not found
    15. print(s.index('or'))
    16.  
    17. #join
    18. c = '-'
    19. print(c.join(['a', 'b', 'c']))
    20.  
    21. #lower
    22. print(s.lower())
    23.  
    24. #replace
    25. print(s.replace('l', '-')) # replace all occurrences
    26. print(s.replace('l', '-', 2)) # replace at most max occurrences
    27.  
    28. #split
    29. str = "Line1-abcdef, \nLine2-abc, \nLine4-abcd"
    30. print(str.split())
    31.  
    32. import re
    33. # use multiple delimiter
    34. print(re.split('\n|, ',str)) # ['Line1-abcdef', '', 'Line2-abc', '', 'Line4-abcd']
    35.  
    36. #strip
    37. print(' Hello ... '.strip())
    38.  
    39. #upper
    40. print(s.upper()) #HELLO WORLD!
    1. # Template
    2. from string import Template
    3. s = Template('$fname, $lname, $fname')
    4.  
    5. sub = s.substitute(fname='Lin', lname='Chen')
    6.  
    7. print(sub) # Lin, Chen, Lin
    Regular Expression
  • Pattern
  • Flags
    1. import re
    2. phone = "2004-959-559 # This is Phone Number"
    3. # match
    4. # match RE pattern to string
    5. # checks for a match only at the beginning of the string
    6. m = re.match(r'(\d+)-(\d+)-\d+.*', phone) # use () to group matches
    7. if m:
    8. print(type(m))
    9. print(m.group()) #2004-959-559 # This is Phone Number
    10. print(m.group(1)) # 2004
    11. print(m.groups()) # ('2004', '959')
    12. # search
    13. # searches for first occurrence of RE pattern
    14. # checks for a match anywhere in the string
    15. s = re.search(r'\d+', phone)
    16. if s:
    17. print(type(s))
    18. print(s.group()) #2004
    19. #findall
    20. a = re.findall(r'\d+', phone)
    21. print(a) #['2004', '959', '559']
    22. #replace
    23. r = re.sub(r'\d', '*', phone)
    24. print(r) # ****-***-***, This is Phone Number
    Unicode
  • a sequence of code points, immutable
  • Python keep characters as unicode in memory
  • type 'str' represents unicode in Python 3, type 'bytes' represent byte string
  • byte string can only contain ASCII literal characters
    1. Working with binary data, such as images, or audio files
    2. When sending or receiving data over network sockets
    3. When reading from or writing to binary files, such as reading an image file
    4. Cryptographic operations often work with binary data, and byte strings are used to hold the binary input, output, and intermediate values in these operations
    5. In some cases, using byte strings can be more memory-efficient and faster than using Unicode strings
    1. # unicode and str
    2. # same in Python 3
    3. s = 'Café' # str
    4. s = u'Café' # str
    5.  
    6. # byte string
    7. s = b'lin' # bytes
    8.  
    9. # str to byte string
    10. s = 'Café'
    11. s.encode('utf-8') # utf-8 is the default encode standard
    12.  
    13. # byte to str
    14. s = b'Caf\xc3\xa9'
    15. s.decode('utf-8')
    16.  
    17. # get unicode code point
    18. c = ord(u'陈') #38472, int
    19. chr(38472)) #陈, str

  • read str, output str
    1. f = open('temp.txt', 'r') # read str
    2. l = next(f)
    3.  
    4. o = open('output.txt', 'w') # write str
    5. o.write(l)
    6.  
    7. o.close()
    8. f.close()
  • read str, output byte string
    1. f = open('temp.txt', 'r') # read str
    2. l = next(f)
    3. o = open('output.txt', 'wb') # write byte string
    4. o.write(l.encode('utf-8'))
    5. o.close()
    6. f.close()
  • read byte string, output str
    1. f = open('temp.txt', 'rb') # read byte string
    2. l = next(f)
    3. o = open('output.txt', 'w') # write str
    4. o.write(l.decode('utf-8'))
    5. o.close()
    6. f.close()
  • read byte string, output byte string
    1. f = open('temp.txt', 'rb') # read byte string
    2. l = next(f)
    3. o = open('output.txt', 'wb') # write byte string
    4. o.write(l)
    5. o.close()
    6. f.close()
  • Reference
  • Strings, Unicode, and Bytes in Python 3: Everything You Always Wanted to Know
  • Python 3 Standard Library
  • Tutorialspoint String
  • Regular Expression
  • Unicode cheat sheet