阿里云大文件解压函数计算

 最近搬迁网站图片有100多G, 上传OSS后发现用原生的函数计算会出现多出来个文件夹名. 修改了下函数

 

# -*- coding: utf-8 -*-
'''
声明:
这个函数针对文件和文件夹命名编码是如下格式:
1. mac/linux 系统, 默认是utf-8
2. windows 系统, 默认是gb2312, 也可以是utf-8

对于其他编码,我们这里尝试使用chardet这个库进行编码判断, 但是这个并不能保证100% 正确,
建议用户先调试函数,如果有必要改写这个函数,并保证调试通过

函数最新进展可以关注该blog: https://yq.aliyun.com/articles/680958

Statement:
This function names and encodes files and folders as follows:
1. MAC/Linux system, default is utf-8
2. For Windows, the default is gb2312 or utf-8

For other encodings, we try to use the chardet library for coding judgment here, 
but this is not guaranteed to be 100% correct. 
If necessary to rewrite this function, and ensure that the debugging pass
'''

import helper
import oss2, json
import os
import logging
import chardet
import time

"""
When a source/ prefix object is placed in an OSS, it is hoped that the object will be decompressed and then stored in the OSS as processed/ prefixed.
For example, source/a.zip will be processed as processed/a/... 
"Source /", "processed/" can be changed according to the user's requirements.

detail: https://yq.aliyun.com/articles/680958
"""
# Close the info log printed by the oss SDK
logging.getLogger("oss2.api").setLevel(logging.ERROR)
logging.getLogger("oss2.auth").setLevel(logging.ERROR)

def handler(event, context):
  """
  The object from OSS will be decompressed automatically .
  param: event:   The OSS event json string. Including oss object uri and other information.
  
  param: context: The function context, including credential and runtime info.

      For detail info, please refer to https://help.aliyun.com/document_detail/56316.html#using-context
  """
  start_time = time.time()
  logger = logging.getLogger()
  logger.setLevel(logging.DEBUG)
  evt = json.loads(event)
  logger.info("Handling event: %s", evt)
  endpoint = 'https://oss-%s-internal.aliyuncs.com' % context.region
  src_client = get_oss_client(context, endpoint, evt["src_bucket"])
  dest_client = get_oss_client(context, endpoint, evt["dest_bucket"])
  key = evt["key"]

  if "ObjectCreated:PutSymlink" == evt.get('event_name'):
    key = src_client.get_symlink(key).target_key
    logger.info("Resolved target key %s from %s", key, evt["key"])
    if key == "":
      raise RuntimeError('{} is invalid symlink file'.format(key))

  ext = os.path.splitext(key)[1]

  if ext != ".zip":
    raise RuntimeError('{} filetype is not zip'.format(key))

  logger.info("Start to decompress zip file %s", key)

  processed_dir = os.environ.get("PROCESSED_DIR", "")
  if processed_dir and processed_dir[-1] != "/":
    processed_dir += "/"

  zip_fp = helper.OssStreamFileLikeObject(src_client, key)
  
  # Run up to threshold seconds
  threshold = evt.get("time_threshold", int(os.environ["TIME_THRESHOLD"]))
  marker = evt.get("marker", "")
  gate_closed = True if marker else False

  with helper.zipfile_support_oss.ZipFile(zip_fp) as zf:
    for name in zf.namelist():
      logger.debug("Processing %s", name)
      elapsed_time = time.time() - start_time
      # If elapsed_time exceeds the threshold, return the name as marker
      if threshold and elapsed_time >= threshold:
        return {
          "marker": name
        }
      # If marker is specified, skip names before the marker 
      if gate_closed:
        if marker != name:
          logger.debug("Skipping key %s", name)
          continue
        else:
          gate_closed = False
      if name.endswith("/"):
        logger.debug("Skipping dir %s", name)
        continue
      logger.debug("Unzipping %s", name)
      with zf.open(name) as file_obj:
        try:
          name = name.encode(encoding='cp437')
        except:
          name = name.encode(encoding='utf-8')
        
        # the string to be detect is long enough, the detection result accuracy is higher 
        detect = chardet.detect( (name*100)[0:100] )
        confidence = detect["confidence"]
        if confidence > 0.8:
          try:
            name = name.decode(encoding=detect["encoding"])
          except:
            name = name.decode(encoding='gb2312')
        else:
          name = name.decode(encoding="gb2312")
          
        dest_client.put_object(processed_dir + name, file_obj)

  # Reaches the end of file
  return {"marker": ""}

def get_oss_client(context, endpoint, bucket):
  creds = context.credentials
  if creds.security_token != None:
    auth = oss2.StsAuth(creds.access_key_id, creds.access_key_secret, creds.security_token)
  else:
    # for local testing, use the public endpoint
    endpoint = str.replace(endpoint, "-internal", "")
    auth = oss2.Auth(creds.access_key_id, creds.access_key_secret)
  return oss2.Bucket(auth, endpoint, bucket)

 

posted on 2020-03-23 12:06  开发员  阅读(408)  评论(0编辑  收藏  举报