sql 数据处理时join字段慎重选择--避免出现double数据!
前几天做BI Tabular表格模型的时候,数据中心核对数据发现模型展现数据比实际数据要多,经过核查之后,发现原来我是一个不经意,在做数据选取的时候,inner join的字段CITY_NAME_CN选取错误。
首先对比事实表和我数据源选取的数据量差异:正常(8080945条记录)
SELECT count(YYYYMMDD) FROM [DM_ACCN_T1].[dbo].[FS_MS_GFK] --8080945
我的数据源sql:选取(8110909条记录)
SELECT [GUID] ,[YYYYMMDD] ,[BRAND] ,[BRAND_GROUP] ,[MODEL] ,[CHANNEL] ,[SALES_UNIT] ,[SALES_VALUE] ,[PC_TYPE] ,[DESIGN] ,[PROCESSOR_BRAND] ,[PROCESSOR] ,[PROCESSOR_NUMBE] ,[STORAGE_CAPAC] ,[RAM_MB] ,[DVD_WRITER] ,[SIZE_INCH] ,[GPU_BRAND] ,[GPU_MODEL] ,[ONBOARD_GRAPHIC] ,[V_RAM] ,[OS_PLANTFORM] ,[MICROSOFT_OS] ,[CATEGORIES] ,[PPC_PRICE_BRAND] ,[DPC_PRICE_BRAND] ,a.[CITY_NAME_CN] ,a.[CITY_NAME_EN] ,ISNULL(b.CITY_CODE,0)AS CITY_CODE ,[BU_GROUP_CODE] ,A.CITY_FULL_NAME_CN ,B.CITY_FULL_NAME ,Form_Factor ,CITY_GROUP FROM [DM_ACCN_T1].[dbo].[FS_MS_GFK] A LEFT JOIN [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] B ON A.CITY_NAME_CN=B.CITY_NAME_CN
SELECT count([YYYYMMDD]) FROM [DM_ACCN_T1].[dbo].[FS_MS_GFK] A LEFT JOIN [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] B ON A.CITY_NAME_CN=B.CITY_NAME_CN
由此可见,问题出在我的事实表A join 维度表B 的时候出现问题,出现了一对多的问题。
首先,我要查一下维表CB_GEOGRAPHY对于字段CITY_NAME_CN(乡镇名称)是不是有重复,
SELECT a.CITY_NAME_CN,count(1)as num FROM [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] a --where a.CITY_NAME_CN=N'九江' group by a.CITY_NAME_CN having count(1)>1
想一下很容易理解,地理维度表中,不同地区的乡镇名可能出现重名。所以特设了一个CITY_FULL_NAME的字段。那么join之后究竟是哪些字段出现了重复呢,
SELECT b.CITY_FULL_NAME_CN,B.CITY_NAME_CN FROM [DM_ACCN_T1].[dbo].[FS_MS_GFK] B inner join (SELECT a.CITY_NAME_CN,count(1)as num FROM [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] a --where a.CITY_NAME_CN=N'九江' group by a.CITY_NAME_CN having count(1)>1 )aa ON aa.CITY_NAME_CN=B.CITY_NAME_CN
发现这些记录数为:29964,而且全是出在'九江'这个地方,8080945-8110909=-29964,BINGO!
故:出现了一对多的情况,join之后出现了DOUBLE的数据,
SELECT A.CITY_FULL_NAME_CN ,B.CITY_FULL_NAME ,A.CITY_NAME_CN ,B.CITY_NAME_CN FROM [DM_ACCN_T1].[dbo].[FS_MS_GFK] A inner JOIN [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] B ON A.CITY_NAME_CN=B.CITY_NAME_CN where b.CITY_NAME_CN=N'九江' --59928 SELECT A.CITY_FULL_NAME_CN ,B.CITY_FULL_NAME ,A.CITY_NAME_CN ,B.CITY_NAME_CN FROM [DM_ACCN_T1].[dbo].[FS_MS_GFK] A LEFT JOIN [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] B ON A.CITY_FULL_NAME_CN=B.CITY_FULL_NAME where b.CITY_NAME_CN=N'九江' ----29964
修改之后,LEFT JOIN [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] B ON A.CITY_FULL_NAME_CN=B.CITY_FULL_NAME,okay!
SELECT [GUID] ,[YYYYMMDD] ,[BRAND] ,[BRAND_GROUP] ,[MODEL] ,[CHANNEL] ,CASE [CHANNEL]WHEN 'NON-ITS'THEN'3C'WHEN 'INTERNET B2C'THEN'NC'WHEN 'ASSB'THEN'TC'WHEN 'CS'THEN'TC'ELSE'TC' END AS CHANNEL_TYPE_CODE ,[SALES_UNIT] ,[SALES_VALUE] ,[PC_TYPE] ,[DESIGN] ,[PROCESSOR_BRAND] ,[PROCESSOR] ,[PROCESSOR_NUMBE] ,[STORAGE_CAPAC] ,[RAM_MB] ,[DVD_WRITER] ,[SIZE_INCH] ,[GPU_BRAND] ,[GPU_MODEL] ,[ONBOARD_GRAPHIC] ,[V_RAM] ,[OS_PLANTFORM] ,[MICROSOFT_OS] ,[CATEGORIES] ,[PPC_PRICE_BRAND] ,[DPC_PRICE_BRAND] ,a.[CITY_NAME_CN] ,a.[CITY_NAME_EN] ,ISNULL(b.CITY_CODE,0)AS CITY_CODE ,[BU_GROUP_CODE] ,A.[CITY_FULL_NAME_CN] ,Form_Factor ,isnull(CITY_GROUP,'')as CITY_GROUP FROM [DM_ACCN_T1].[dbo].[FS_MS_GFK] A LEFT JOIN [DM_ACCN_T1].[dbo].[CB_GEOGRAPHY] B ON A.CITY_FULL_NAME_CN=B.CITY_FULL_NAME