sql 数据处理时join字段慎重选择--避免出现double数据!

前几天做BI Tabular表格模型的时候,数据中心核对数据发现模型展现数据比实际数据要多,经过核查之后,发现原来我是一个不经意,在做数据选取的时候,inner join的字段CITY_NAME_CN选取错误。

首先对比事实表和我数据源选取的数据量差异:正常(8080945条记录)

SELECT 
      count(YYYYMMDD)
  FROM [DM_ACCN_T1].[dbo].[FS_MS_GFK]  --8080945
View Code

我的数据源sql:选取(8110909条记录)

SELECT  [GUID]
      ,[YYYYMMDD]
      ,[BRAND]
      ,[BRAND_GROUP]
      ,[MODEL]
      ,[CHANNEL]
      ,[SALES_UNIT]
      ,[SALES_VALUE]
      ,[PC_TYPE]
      ,[DESIGN]
      ,[PROCESSOR_BRAND]
      ,[PROCESSOR]
      ,[PROCESSOR_NUMBE]
      ,[STORAGE_CAPAC]
      ,[RAM_MB]
      ,[DVD_WRITER]
      ,[SIZE_INCH]
      ,[GPU_BRAND]
      ,[GPU_MODEL]
      ,[ONBOARD_GRAPHIC]
      ,[V_RAM]
      ,[OS_PLANTFORM]
      ,[MICROSOFT_OS]
      ,[CATEGORIES]
      ,[PPC_PRICE_BRAND]
      ,[DPC_PRICE_BRAND]
      ,a.[CITY_NAME_CN]
      ,a.[CITY_NAME_EN]
      ,ISNULL(b.CITY_CODE,0)AS CITY_CODE
      ,[BU_GROUP_CODE]
      ,A.CITY_FULL_NAME_CN
      ,B.CITY_FULL_NAME
      ,Form_Factor
,CITY_GROUP
  FROM [DM_ACCN_T1].[dbo].[FS_MS_GFK] A
   LEFT JOIN  [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] B ON A.CITY_NAME_CN=B.CITY_NAME_CN
View Code
SELECT 
      count([YYYYMMDD])
  FROM [DM_ACCN_T1].[dbo].[FS_MS_GFK] A
   LEFT JOIN  [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] B ON A.CITY_NAME_CN=B.CITY_NAME_CN
View Code

由此可见,问题出在我的事实表A join 维度表B 的时候出现问题,出现了一对多的问题。

首先,我要查一下维表CB_GEOGRAPHY对于字段CITY_NAME_CN(乡镇名称)是不是有重复,

SELECT a.CITY_NAME_CN,count(1)as num
   FROM [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] a
   --where a.CITY_NAME_CN=N'九江'
   group by a.CITY_NAME_CN having count(1)>1 

 

想一下很容易理解,地理维度表中,不同地区的乡镇名可能出现重名。所以特设了一个CITY_FULL_NAME的字段。那么join之后究竟是哪些字段出现了重复呢,

   SELECT b.CITY_FULL_NAME_CN,B.CITY_NAME_CN FROM [DM_ACCN_T1].[dbo].[FS_MS_GFK] B inner join 
   (SELECT a.CITY_NAME_CN,count(1)as num
   FROM [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] a
   --where a.CITY_NAME_CN=N'九江'
   group by a.CITY_NAME_CN having count(1)>1 )aa  ON aa.CITY_NAME_CN=B.CITY_NAME_CN

发现这些记录数为:29964,而且全是出在'九江'这个地方,8080945-8110909=-29964,BINGO!

故:出现了一对多的情况,join之后出现了DOUBLE的数据,

SELECT 
      A.CITY_FULL_NAME_CN
      ,B.CITY_FULL_NAME
      ,A.CITY_NAME_CN
      ,B.CITY_NAME_CN
  FROM [DM_ACCN_T1].[dbo].[FS_MS_GFK] A
   inner JOIN  [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] B ON A.CITY_NAME_CN=B.CITY_NAME_CN
   where b.CITY_NAME_CN=N'九江' --59928

SELECT 
      A.CITY_FULL_NAME_CN
      ,B.CITY_FULL_NAME
      ,A.CITY_NAME_CN
      ,B.CITY_NAME_CN
  FROM [DM_ACCN_T1].[dbo].[FS_MS_GFK] A
   LEFT JOIN  [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] B ON A.CITY_FULL_NAME_CN=B.CITY_FULL_NAME
   where b.CITY_NAME_CN=N'九江'   ----29964
View Code

修改之后,LEFT JOIN  [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] B ON A.CITY_FULL_NAME_CN=B.CITY_FULL_NAME,okay!

SELECT  [GUID]
      ,[YYYYMMDD]
      ,[BRAND]
      ,[BRAND_GROUP]
      ,[MODEL]
   ,[CHANNEL]
      ,CASE [CHANNEL]WHEN 'NON-ITS'THEN'3C'WHEN 'INTERNET B2C'THEN'NC'WHEN 'ASSB'THEN'TC'WHEN 'CS'THEN'TC'ELSE'TC' END AS CHANNEL_TYPE_CODE
      ,[SALES_UNIT]
      ,[SALES_VALUE]
      ,[PC_TYPE]
      ,[DESIGN]
      ,[PROCESSOR_BRAND]
      ,[PROCESSOR]
      ,[PROCESSOR_NUMBE]
      ,[STORAGE_CAPAC]
      ,[RAM_MB]
      ,[DVD_WRITER]
      ,[SIZE_INCH]
      ,[GPU_BRAND]
      ,[GPU_MODEL]
      ,[ONBOARD_GRAPHIC]
      ,[V_RAM]
      ,[OS_PLANTFORM]
      ,[MICROSOFT_OS]
      ,[CATEGORIES]
      ,[PPC_PRICE_BRAND]
      ,[DPC_PRICE_BRAND]
      ,a.[CITY_NAME_CN]
      ,a.[CITY_NAME_EN]
      ,ISNULL(b.CITY_CODE,0)AS CITY_CODE
      ,[BU_GROUP_CODE]
      ,A.[CITY_FULL_NAME_CN]
      ,Form_Factor
      ,isnull(CITY_GROUP,'')as CITY_GROUP
  FROM [DM_ACCN_T1].[dbo].[FS_MS_GFK] A
   LEFT JOIN  [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] B ON A.CITY_FULL_NAME_CN=B.CITY_FULL_NAME
View Code

 

posted @ 2016-10-11 10:41  依旧一生有你  阅读(1258)  评论(0编辑  收藏  举报